top of page

Uncovering Patterns in Tech News with Association Rule MininG

Overview

Association Rule Mining (ARM) is a powerful unsupervised learning technique used to discover interesting relationships or patterns (associations) between items in a dataset. Traditionally used in market basket analysis (e.g., discovering that people who buy "bread" often buy "butter"), ARM is increasingly applied in text mining — where words act as items and documents as transactions.

In this project, we apply ARM to a dataset of news articles related to smartphone chargers. Each article is treated as a transaction, and the preprocessed words from that article represent the items. This approach allows us to uncover how words are contextually associated across multiple articles — helping to surface common product features, consumer concerns, brand mentions, or innovations in charger technology.

Key Concepts and Metrics in ARM

Support Definition: How frequently an itemset appears in the dataset. Formula:

Screenshot 2025-03-28 at 7.52.51 PM.png

Confidence Definition: How often items in B appear in transactions that contain A. Formula:

Screenshot 2025-03-28 at 7.55.44 PM.png

Lift Definition: How much more likely B is to appear with A than would be expected if A and B were independent. Formula:

Screenshot 2025-03-28 at 7.56.39 PM.png

Data Preparation

Apriori AlgorithmThis algorithm efficiently finds frequent itemsets using a bottom-up approach, pruning infrequent item combinations based on minimum support thresholds. It helps reduce computational complexity by eliminating non-promising candidates early in the process.

​

​

Why Transaction Data?
ARM requires transactional, unlabeled data. This means each observation should be a list of items (in our case, words), without any associated class or outcome labels. The goal is to identify how items co-occur across many observations.

Dataset Overview

We used a news dataset containing charger-related articles. Each article includes text under fields like Title, Description, and Content.

Preprocessing Steps

  1. Lowercasing: Convert all text to lowercase.

  2. Tokenization: Split content into words.

  3. Cleaning: Remove stop words, punctuation, numbers, and non-alphabetic tokens.

  4. Deduplication: Remove repeated words in the same document.

  5. Transaction Creation: Treat each cleaned article as a transaction of words.

Before Transformation

Screenshot 2025-03-28 at 8.02.31 PM.png

After Transformation

Screenshot 2025-03-28 at 8.03.09 PM.png

Code Summary

The transformation and mining process was implemented in Python using nltk (for NLP), mlxtend (for Apriori), and networkx (for visualization).
 

Processing Pipeline

  • Clean and tokenize articles → remove stopwords → generate list of transactions

  • Encode data using TransactionEncoder

  • Apply Apriori algorithm (min_support = 0.02)

  • Generate rules and compute support, confidence, and lift

  • Visualize with network graphs

We extracted association rules based on different thresholds and metrics. The top 15 rules for each of the key metrics for support, confidence, and lift are shown below:

Top 15 Rules by Support

These are the most frequently co-occurring item pairs in the dataset. The most common associations involved the words "samsung", "galaxi", and "charger".

Screenshot 2025-03-28 at 8.07.10 PM.png

These rules indicate common product mentions, such as “Samsung Galaxy” and various charger-related terms.

Top 15 Rules by Confidence

These rules are almost always true when the antecedent occurs. Confidence values of 1.0 show perfect reliability within the dataset.

Screenshot 2025-03-28 at 8.09.35 PM.png

ome rules with 100% confidence suggest strong co-occurrence of words describing feature upgrades (e.g., "vent", "wireless", "upgrad") and device specifications.

Top 15 Rules by Lift

Lift reveals the strongest non-random associations, some of which are more corporate or location-based.

Screenshot 2025-03-28 at 8.11.35 PM.png

These high-lift associations often relate to specific press releases or corporate statements that consistently use the same phrases, such as "divis", "corpor", and "globe".

Network Visualization

Screenshot 2025-03-28 at 8.16.55 PM.png

This network graph visualizes the top association rules by lift, highlighting strong non-random word pairings in charger-related news. Words like corpor, divis, and north frequently co-occur, suggesting structured phrasing in corporate or press release content. High lift values indicate tight semantic coupling between these terms across multiple articles.

This graph shows the top association rules by support, emphasizing the most frequently co-occurring word pairs across articles. Terms like samsung, galaxi, ultra, and new form a tight cluster, reflecting common product descriptions. The frequent link between charg and charger highlights consistent usage of core terminology in charger-related news.

Screenshot 2025-03-28 at 8.47.15 PM.png
Screenshot 2025-03-28 at 8.49.02 PM.png

This graph illustrates the top association rules by confidence, highlighting rules that are nearly always true when the antecedent occurs. Strong directional links toward samsung from terms like screen, review, and smartphon suggest consistent brand references. Another cluster shows vent, wireless, and upgrad, pointing to high-confidence associations around charger features and innovation.

Conclusion

Through the use of Association Rule Mining on charger-related news articles, we were able to extract valuable insights:

  • Brand Clustering: “Samsung” and “Galaxy” consistently co-occur, showing brand-product tightness in reporting.

  • Feature Innovation: “Wireless”, “vent”, “upgrade” — these terms appear together in high-confidence rules, pointing to innovation narratives in charger products.

  • Corporate Language: High lift values linked words like “corpor”, “divis”, and “globe”, revealing structured vocabulary in press release content.

These insights can aid market trend analysis, content summarization, or even product development by highlighting recurring language around technologies.

bottom of page