Online News Popularity Analysis — Predicting Article Shares Before Publication

Author

Christopher Legarda

Published

November 17, 2025

1. Summary

This project analyzes the Online News Popularity dataset from Kaggle to create a business-focused report.

I selected this specific dataset to show my data analysis skills towards more business orientated reporting, oppose to community impact/social impact reporting, where this report focuses on data-driven insights to support business decisions.

This dataset has over 39,000 online articles; features such as content structure, keyword performance, sentiment, and the engagement score (measured via social media shares) characterize each article.

In this report, I’ll try to convert raw engagement data into actionable business insights by integrating exploratory visualization, feature analysis, and predictive modeling, which could help digital media organizations improve their content planning, publishing, and positioning.

2. Business Problem & Analytical Questions

In digital media, article engagement is the main metric of its success. Understanding why certain articles “go viral” while others do not can help organizations make informed decisions and plan advertising strategies and editorial resource allocation.

This project aims to explore the factors that influence online engagement and provide insights into how publishers can increase shareability, visibility, and reader retention.

The report seeks to identify data-driven strategies for improving content performance and to evaluate the possibility of predicting an article’s virality before its publication.

To achieve this, the analysis is guided by six key business questions:

# Business Question Focus Area
1. What types of content drive the most engagement? Topic & category performance
2. Does article length affect popularity? Content structure & readability
3. How do timing and recency matter? Publishing schedule optimization
4. What role do visuals and links play? Multimedia & SEO strategy
5. Do sentiment and keywords influence sharing? Tone, emotion, and keyword strength
6. Can we predict article popularity before publication? Predictive modeling for editorial planning

Together, these questions form a data-to-decision framework that mirrors the real-world analytics process — from raw data exploration to business insights and predictive forecasting.

3. Data Overview & Methodology

3.1 Data Source & Description

The dataset used in this analysis is the Online News Popularity dataset, originally published by Mashable and made available through the UCI Machine Learning Repository and on Kaggle.

In this downloadable dataset, it contains 39,644 rows (online news articles), collected between January 2013 and December 2014, where each labeled with metadata, content features, and social media engagement metrics (measured by the total number of shares on platforms like Facebook, LinkedIn, and Google+).

Each row in the dataset represents a single article, and each column captures an attribute describing aspects of its content, publication timing, sentiment, and keyword performance.

This dataset provides an opportunity to explore business focus data analysis by identifying engagement patterns to inform editorial strategy, SEO optimization, and content marketing decisions.

3.2 Key Feature Categories

The table below shows features that are grouped into meaningful categories that reflect editorial and marketing considerations, making the data more interpretable for business stakeholders.

Feature Category Example Columns Business Interpretation
Content Length n_tokens_title, n_tokens_content Title and article word counts — indicate readability and information depth.
Multimedia Richness num_imgs, num_videos Number of images and videos — measure visual engagement level.
SEO & Linking Strategy num_hrefs, num_self_hrefs Counts of external and internal hyperlinks — reflect how optimized an article is for search and cross-navigation.
Keyword Performance kw_min_avg, kw_max_avg, kw_avg_avg Aggregated keyword popularity metrics — proxy for topic relevance and trend appeal.
Sentiment & Tone global_sentiment_polarity, global_subjectivity, title_sentiment_polarity, title_subjectivity Emotional tone and objectivity of both the article and its title.
Timing Features weekday_is_*, is_weekend, timedelta Publication day and recency — reveal engagement patterns over time.
Engagement Outcome shares The number of times an article was shared — the main measure of popularity and performance.

3.3 Analytical Approach

This report uses a set method to go from finding patterns to providing useful information. Below is a list of analytical techniques that are commonly found in business analytics that I’ve implemented in this report:

  1. Descriptive Analytics–Examine historical data to identify how content type, sentiment, visuals, and timing relate to engagement.

  2. Diagnostic Analytics–Explore why certain articles perform better using correlation analysis, segmentation, and visual storytelling.

  3. Predictive Analytics–Build a simple model to evaluate whether pre-publication article features can predict shareability or “virality.”

I conducted the analysis using Python on Jupyter Notebook, including libraries like ‘pandas’ and ‘numpy’ for data manipulation, ‘matplotlib’ and ‘seaborn’ for visualization, and ‘scikit-learn’ for statistical modeling and prediction.

Since this report targets non-technical and business audiences, it emphasizes the findings and insights rather than the code implementation itself.

Before analysis, I’ve performed all data preparation steps, such as handling missing values, cleaning, and transformations, to ensure data accuracy and consistency.

4. Exploratory Data Analysis (EDA & Insights)

This section is the main starting point of the report, which explores through six of the business-driven analytical questions that I’ve mentioned above. Each subsection below presents visuals, descriptive statistics, and summarized insights.

Question 1 — What types of content drive the most engagement?

Business Context:

Understanding which content categories perform best helps editorial teams prioritize topics that consistently attract higher engagement.

This analysis examines which Mashable sections (e.g., Lifestyle, Business, Technology, etc.) receive the most shares.

Approach:

  • Group articles by channel (e.g., Lifestyle, Business, Tech, etc.)
  • Compare their median and distribution of shares
  • Visualize with bar and box plots to identify engagement patterns

Insight Summary:

We found that Social Media articles drive the highest engagement, with a median of ~2,100 shares and strong viral upside. Right after Social Media will be Uncategorized with a median of 1,900. Lifestyle and Technology follow closely, offering reliable performance. Business and World articles generate lower engagement, even though they are more stable. This suggests that editorial resources should weight toward Social Media, Uncategorized, Lifestyle, and Tech if you want to maximize reach.

Question 2 — Does article length affect popularity?

Business Context:
Article length can influence how readers engage with certain content. Shorter articles may seem attractive to people who want to have a quick read, while longer articles can deliver more depth and detail.

This analysis will attempt to determine whether title length and content length (measured by estimated reading time) influence how often people share an article.

Approach:

  • Analyze the distribution of n_tokens_title and n_tokens_content to understand writing patterns.
  • Convert content length into estimated reading time, assuming an average reading speed of 200 words per minute.
  • Group articles into title length and reading time bins, then calculate median shares within each group to detect engagement trends.

Insight Summary:

The first chart shows that the title lengths follow a roughly normal distribution, centered on ten words, showing that most editors in the dataset usually around that amount of title length. However, most articles are relatively short, right under 2,000 words, and content length is highly skewed to the right. We can also see in the same chart that a tiny fraction of the articles extends over 2,000 (higher content length).

When comparing the two engagement bar charts, we can see that Median Shares by Title Length containing sixteen or more words generated the highest median shares, showing that more words in the title section attract readers’ attention and encourage sharing. Similarly, in the Median Shares by Content Length (Reading Time), the chart reads that articles that take over twenty minutes to read also achieved the highest median share counts. This can suggest that readers are more likely to share articles that have more words and details, possibly these articles convey unique insights. However, since such long articles are relatively rare, the results likely reflect the higher quality and editorial investment in these pieces rather than length alone.

Question 3 — How do timing and recency matter?

Business Context:

In news media, timing affects visibility, its competition and user behavior. Identifying when engagement peaks can help optimize content scheduling.

Approach:

  • Analyze weekday_is_* and is_weekend variables
  • Compare median shares by weekday and weekend
  • Evaluate timedelta (days since publication) to understand recency impact

Insight Summary:

The analysis shows a clear temporal pattern in audience engagement. Articles published on the weekends significantly outperform those released during the weekdays, with Saturday reaching the highest median share count (around 2,000) and Sunday following closely behind.

Looking at the Median Shares: Weekday vs. Weekend, we can see that weekday articles average between 1,300 and 1,500 median shares, suggesting lower reader interaction compared to the weekend, where articles average between 1,800 and 2,000. This pattern implies that readers are more likely to discover and share content during the weekend compared to traditional working hours.

For editorial teams, these findings suggest that scheduling major stories for feature articles for weekend publication could improve shareability and overall reach.

Question 5 — Do sentiment and keywords influence sharing?

Business Context:

For this question, I will be answering how certain keywords and sets affect the emotional tone of the article, which drives online engagement.

Articles that have the right emotional balance, or align with the trend, are often associated with high-performing keywords, and are more likely to attract attention and sharing.

This analysis explores how sentiment polarity, subjectivity, and keyword strength contribute to article popularity.

Approach:

  • Evaluate sentiment variables (global_sentiment_polarity, title_sentiment_polarity, global_subjectivity, title_subjectivity) to measure tone and emotional intensity.
  • Examine keyword-based metrics (kw_min_avg, kw_max_avg, kw_avg_avg) that represent how popular or widely shared those keywords are across the entire dataset.
  • Compare median shares across binned sentiment and keyword quartiles to identify which emotional tones and topic strengths are most associated with engagement.
              count         mean          std  min          25%          50%          75%            max
kw_min_avg  39644.0  1117.146610  1137.456951 -1.0     0.000000  1023.635611  2056.781032    3613.039819
kw_max_avg  39644.0  5657.211151  6098.871957  0.0  3562.101631  4355.688836  6019.953968  298400.000000
kw_avg_avg  39644.0  3135.858639  1318.150397  0.0  2382.448566  2870.074878  3600.229564   43567.659946
Spearman correlation between kw_min_avg and shares: 0.103
Spearman correlation between kw_max_avg and shares: 0.223
Spearman correlation between kw_avg_avg and shares: 0.256

Insight Summary:

The analysis confirms that both emotional tone and keyword popularity have meaningful effects on shareability.

Articles with a positive sentiment polarity, especially those written with optimistic or emotionally expressive language, receive more shares. Similarly, subjective writing, where author’s present opinions or emotional perspectives, also correlates with higher engagement, likely because such content feels more relatable and authentic to readers.

When examining keyword metrics, two variables stood out as the most influential:

  • kw_avg_avg — representing the average popularity of all keywords in an article, reflecting overall topic appeal.

  • kw_max_avg — capturing the strongest or trendiest keyword within an article, showing viral potential tied to a single trending topic.

Articles that scored high on both measures saw significantly more shares, demonstrating that using popular or high-traffic topics boosts visibility.

Since these variables showed the strongest and most stable correlations with article shares, the team selected them as core predictive features for the modeling phase in Question 6.

In summary, positive tone, expressive style, and keyword relevance are key elements of viral content.

This finding bridges the exploratory analysis to predictive modeling, providing a data-driven rationale for which factors best forecast article success.

Question 6 — Can we predict article popularity before publication?

Business Context:

If editors could estimate the popularity of an article before publishing, they could strategically allocate resources by prioritizing high-affected stories, optimizing headlines, and tailoring content strategy around engagement potential.

To analyze this problem, researchers reframed it as a predictive analytics task, testing whether metadata and pre-publication features (like sentiment, keywords, and structure) can meaningfully forecast share counts.

Approach:

  • Model 1: Regression — Predict the continuous number of shares using Random Forest Regressor.
  • Model 2: Classification — Label each article as viral or non-viral using the median shares threshold, then predict the class with a Random Forest Classifier.
  • Evaluate model performance using key metrics:
    • Regression: R², MAE, RMSE
    • Classification: Accuracy, F1-score, and ROC-AUC
  • Analyze feature importance to identify which article attributes most influence shareability.
  Metric Value
0 0.019
1 MAE 3,030.389
2 RMSE 10,691.588
  Metric Value
0 Accuracy 0.642
1 F1 Score 0.634
2 ROC-AUC 0.686
Predicted: Non-Viral Predicted: Viral
Actual: Non-Viral 3291 1730
Actual: Viral 1815 3075

Insight Summary:

The regression model achieved an R² of 0.019, with an MAE around 3,030 and an RMSE near 10,692, showing that predicting exact share counts is highly uncertain. Audience engagement remains influenced by external and unpredictable factors (like news cycles and platform dynamics).

However, the classification model performed more reliably, reaching ~64% accuracy, F1 = 0.63, and ROC-AUC = 0.69, meaning it can correctly distinguish “viral” from “non-viral” articles more often than random chance. The confusion matrix shows balanced results between the two classes, confirming that the model generalizes moderately well.

Feature importance analysis highlights that keyword popularity metrics (kw_avg_avg, kw_max_avg), content length, and sentiment polarity are the most predictive variables. These features together capture both what the article is about and how it communicates — two key drivers of audience engagement.

Overall, while predicting the exact number of shares remains difficult, predictive classification offers valuable editorial guidance. Editors can use such models to assess potential virality before publication, enabling data-driven decisions in content planning, promotion timing, and SEO optimization.

Key Predictive Features Identified

Rank Feature Interpretation
1 kw_avg_avg Average popularity of used keywords; high values indicate topics readers already engage with.
2 kw_max_avg Strongest keyword appeal; articles with trending keywords attract more attention.
3 n_tokens_content Longer, in-depth articles tend to get shared more.
4 global_sentiment_polarity Positive tone enhances reader resonance.
5 num_hrefs Moderate external linking boosts credibility and engagement.

Conclusion:

Even basic machine learning models help us understand things. News organizations can improve their content before publishing by figuring out what makes something popular.