Online News Popularity Analysis — Predicting Article Shares Before Publication
Author
Christopher Legarda
Published
November 17, 2025
1. Summary
This project analyzes the Online News Popularity dataset from Kaggle to create a business-focused report.
I selected this specific dataset to show my data analysis skills towards more business orientated reporting, oppose to community impact/social impact reporting, where this report focuses on data-driven insights to support business decisions.
This dataset has over 39,000 online articles; features such as content structure, keyword performance, sentiment, and the engagement score (measured via social media shares) characterize each article.
In this report, I’ll try to convert raw engagement data into actionable business insights by integrating exploratory visualization, feature analysis, and predictive modeling, which could help digital media organizations improve their content planning, publishing, and positioning.
2. Business Problem & Analytical Questions
In digital media, article engagement is the main metric of its success. Understanding why certain articles “go viral” while others do not can help organizations make informed decisions and plan advertising strategies and editorial resource allocation.
This project aims to explore the factors that influence online engagement and provide insights into how publishers can increase shareability, visibility, and reader retention.
The report seeks to identify data-driven strategies for improving content performance and to evaluate the possibility of predicting an article’s virality before its publication.
To achieve this, the analysis is guided by six key business questions:
#
Business Question
Focus Area
1.
What types of content drive the most engagement?
Topic & category performance
2.
Does article length affect popularity?
Content structure & readability
3.
How do timing and recency matter?
Publishing schedule optimization
4.
What role do visuals and links play?
Multimedia & SEO strategy
5.
Do sentiment and keywords influence sharing?
Tone, emotion, and keyword strength
6.
Can we predict article popularity before publication?
Predictive modeling for editorial planning
Together, these questions form a data-to-decision framework that mirrors the real-world analytics process — from raw data exploration to business insights and predictive forecasting.
3. Data Overview & Methodology
3.1 Data Source & Description
The dataset used in this analysis is the Online News Popularity dataset, originally published by Mashable and made available through the UCI Machine Learning Repository and on Kaggle.
In this downloadable dataset, it contains 39,644 rows (online news articles), collected between January 2013 and December 2014, where each labeled with metadata, content features, and social media engagement metrics (measured by the total number of shares on platforms like Facebook, LinkedIn, and Google+).
Each row in the dataset represents a single article, and each column captures an attribute describing aspects of its content, publication timing, sentiment, and keyword performance.
This dataset provides an opportunity to explore business focus data analysis by identifying engagement patterns to inform editorial strategy, SEO optimization, and content marketing decisions.
3.2 Key Feature Categories
The table below shows features that are grouped into meaningful categories that reflect editorial and marketing considerations, making the data more interpretable for business stakeholders.
Feature Category
Example Columns
Business Interpretation
Content Length
n_tokens_title, n_tokens_content
Title and article word counts — indicate readability and information depth.
Multimedia Richness
num_imgs, num_videos
Number of images and videos — measure visual engagement level.
SEO & Linking Strategy
num_hrefs, num_self_hrefs
Counts of external and internal hyperlinks — reflect how optimized an article is for search and cross-navigation.
Keyword Performance
kw_min_avg, kw_max_avg, kw_avg_avg
Aggregated keyword popularity metrics — proxy for topic relevance and trend appeal.
Emotional tone and objectivity of both the article and its title.
Timing Features
weekday_is_*, is_weekend, timedelta
Publication day and recency — reveal engagement patterns over time.
Engagement Outcome
shares
The number of times an article was shared — the main measure of popularity and performance.
3.3 Analytical Approach
This report uses a set method to go from finding patterns to providing useful information. Below is a list of analytical techniques that are commonly found in business analytics that I’ve implemented in this report:
Descriptive Analytics–Examine historical data to identify how content type, sentiment, visuals, and timing relate to engagement.
Diagnostic Analytics–Explore why certain articles perform better using correlation analysis, segmentation, and visual storytelling.
Predictive Analytics–Build a simple model to evaluate whether pre-publication article features can predict shareability or “virality.”
I conducted the analysis using Python on Jupyter Notebook, including libraries like ‘pandas’ and ‘numpy’ for data manipulation, ‘matplotlib’ and ‘seaborn’ for visualization, and ‘scikit-learn’ for statistical modeling and prediction.
Since this report targets non-technical and business audiences, it emphasizes the findings and insights rather than the code implementation itself.
Before analysis, I’ve performed all data preparation steps, such as handling missing values, cleaning, and transformations, to ensure data accuracy and consistency.
4. Exploratory Data Analysis (EDA & Insights)
This section is the main starting point of the report, which explores through six of the business-driven analytical questions that I’ve mentioned above. Each subsection below presents visuals, descriptive statistics, and summarized insights.
Question 1 — What types of content drive the most engagement?
Business Context:
Understanding which content categories perform best helps editorial teams prioritize topics that consistently attract higher engagement.
This analysis examines which Mashable sections (e.g., Lifestyle, Business, Technology, etc.) receive the most shares.
Approach:
Group articles by channel (e.g., Lifestyle, Business, Tech, etc.)
Compare their median and distribution of shares
Visualize with bar and box plots to identify engagement patterns
Insight Summary:
We found that Social Media articles drive the highest engagement, with a median of ~2,100 shares and strong viral upside. Right after Social Media will be Uncategorized with a median of 1,900. Lifestyle and Technology follow closely, offering reliable performance. Business and World articles generate lower engagement, even though they are more stable. This suggests that editorial resources should weight toward Social Media, Uncategorized, Lifestyle, and Tech if you want to maximize reach.
Question 2 — Does article length affect popularity?
Business Context:
Article length can influence how readers engage with certain content. Shorter articles may seem attractive to people who want to have a quick read, while longer articles can deliver more depth and detail.
This analysis will attempt to determine whether title length and content length (measured by estimated reading time) influence how often people share an article.
Approach:
Analyze the distribution of n_tokens_title and n_tokens_content to understand writing patterns.
Convert content length into estimated reading time, assuming an average reading speed of 200 words per minute.
Group articles into title length and reading time bins, then calculate median shares within each group to detect engagement trends.
Insight Summary:
The first chart shows that the title lengths follow a roughly normal distribution, centered on ten words, showing that most editors in the dataset usually around that amount of title length. However, most articles are relatively short, right under 2,000 words, and content length is highly skewed to the right. We can also see in the same chart that a tiny fraction of the articles extends over 2,000 (higher content length).
When comparing the two engagement bar charts, we can see that Median Shares by Title Length containing sixteen or more words generated the highest median shares, showing that more words in the title section attract readers’ attention and encourage sharing. Similarly, in the Median Shares by Content Length (Reading Time), the chart reads that articles that take over twenty minutes to read also achieved the highest median share counts. This can suggest that readers are more likely to share articles that have more words and details, possibly these articles convey unique insights. However, since such long articles are relatively rare, the results likely reflect the higher quality and editorial investment in these pieces rather than length alone.
Question 3 — How do timing and recency matter?
Business Context:
In news media, timing affects visibility, its competition and user behavior. Identifying when engagement peaks can help optimize content scheduling.
Approach:
Analyze weekday_is_* and is_weekend variables
Compare median shares by weekday and weekend
Evaluate timedelta (days since publication) to understand recency impact
Insight Summary:
The analysis shows a clear temporal pattern in audience engagement. Articles published on the weekends significantly outperform those released during the weekdays, with Saturday reaching the highest median share count (around 2,000) and Sunday following closely behind.
Looking at the Median Shares: Weekday vs. Weekend, we can see that weekday articles average between 1,300 and 1,500 median shares, suggesting lower reader interaction compared to the weekend, where articles average between 1,800 and 2,000. This pattern implies that readers are more likely to discover and share content during the weekend compared to traditional working hours.
For editorial teams, these findings suggest that scheduling major stories for feature articles for weekend publication could improve shareability and overall reach.
Question 4 — What role do visuals and links play?
Business Context:
Different visuals and hyperlinks play a key role in how readers engage with online content. For example, images and videos can enhance storytelling and emotional connections, while hyperlinks can improve credibility, depth and SEO performance.
Approach:
Examine the distributions of num_imgs, num_videos, num_hrefs, and num_self_hrefs.
Compare median shares across binned groups (e.g., 0, 1–2, 3–5, 6–10, 10+).
Use Spearman correlation to measure non-linear relationships between these variables and shares.
Identify whether richer multimedia or higher hyperlink counts consistently align with stronger engagement.
Insight Summary:
The analysis above shows that multimedia and linking result in high shareability. Looking at the graph Median Shares by # Images, it shows that 10 and more images or around 20 through 40 hyperlinks reveal higher median shares, suggesting that visual and reference links enhance readers’ willingness to share the articles.
Content quality and contextual relevance of visuals and links matter more than their sheer quantity, guiding editors to favor well-curated, purposeful multimedia integration over volume.
Question 5 — Do sentiment and keywords influence sharing?
Business Context:
For this question, I will be answering how certain keywords and sets affect the emotional tone of the article, which drives online engagement.
Articles that have the right emotional balance, or align with the trend, are often associated with high-performing keywords, and are more likely to attract attention and sharing.
This analysis explores how sentiment polarity, subjectivity, and keyword strength contribute to article popularity.
Approach:
Evaluate sentiment variables (global_sentiment_polarity, title_sentiment_polarity, global_subjectivity, title_subjectivity) to measure tone and emotional intensity.
Examine keyword-based metrics (kw_min_avg, kw_max_avg, kw_avg_avg) that represent how popular or widely shared those keywords are across the entire dataset.
Compare median shares across binned sentiment and keyword quartiles to identify which emotional tones and topic strengths are most associated with engagement.
count mean std min 25% 50% 75% max
kw_min_avg 39644.0 1117.146610 1137.456951 -1.0 0.000000 1023.635611 2056.781032 3613.039819
kw_max_avg 39644.0 5657.211151 6098.871957 0.0 3562.101631 4355.688836 6019.953968 298400.000000
kw_avg_avg 39644.0 3135.858639 1318.150397 0.0 2382.448566 2870.074878 3600.229564 43567.659946
Spearman correlation between kw_min_avg and shares: 0.103
Spearman correlation between kw_max_avg and shares: 0.223
Spearman correlation between kw_avg_avg and shares: 0.256
Insight Summary:
The analysis confirms that both emotional tone and keyword popularity have meaningful effects on shareability.
Articles with a positive sentiment polarity, especially those written with optimistic or emotionally expressive language, receive more shares. Similarly, subjective writing, where author’s present opinions or emotional perspectives, also correlates with higher engagement, likely because such content feels more relatable and authentic to readers.
When examining keyword metrics, two variables stood out as the most influential:
kw_avg_avg — representing the average popularity of all keywords in an article, reflecting overall topic appeal.
kw_max_avg — capturing the strongest or trendiest keyword within an article, showing viral potential tied to a single trending topic.
Articles that scored high on both measures saw significantly more shares, demonstrating that using popular or high-traffic topics boosts visibility.
Since these variables showed the strongest and most stable correlations with article shares, the team selected them as core predictive features for the modeling phase in Question 6.
In summary, positive tone, expressive style, and keyword relevance are key elements of viral content.
This finding bridges the exploratory analysis to predictive modeling, providing a data-driven rationale for which factors best forecast article success.
Question 6 — Can we predict article popularity before publication?
Business Context:
If editors could estimate the popularity of an article before publishing, they could strategically allocate resources by prioritizing high-affected stories, optimizing headlines, and tailoring content strategy around engagement potential.
To analyze this problem, researchers reframed it as a predictive analytics task, testing whether metadata and pre-publication features (like sentiment, keywords, and structure) can meaningfully forecast share counts.
Approach:
Model 1: Regression — Predict the continuous number of shares using Random Forest Regressor.
Model 2: Classification — Label each article as viral or non-viral using the median shares threshold, then predict the class with a Random Forest Classifier.
Evaluate model performance using key metrics:
Regression: R², MAE, RMSE
Classification: Accuracy, F1-score, and ROC-AUC
Analyze feature importance to identify which article attributes most influence shareability.
Metric
Value
0
R²
0.019
1
MAE
3,030.389
2
RMSE
10,691.588
Metric
Value
0
Accuracy
0.642
1
F1 Score
0.634
2
ROC-AUC
0.686
Predicted: Non-Viral
Predicted: Viral
Actual: Non-Viral
3291
1730
Actual: Viral
1815
3075
Insight Summary:
The regression model achieved an R² of 0.019, with an MAE around 3,030 and an RMSE near 10,692, showing that predicting exact share counts is highly uncertain. Audience engagement remains influenced by external and unpredictable factors (like news cycles and platform dynamics).
However, the classification model performed more reliably, reaching ~64% accuracy, F1 = 0.63, and ROC-AUC = 0.69, meaning it can correctly distinguish “viral” from “non-viral” articles more often than random chance. The confusion matrix shows balanced results between the two classes, confirming that the model generalizes moderately well.
Feature importance analysis highlights that keyword popularity metrics (kw_avg_avg, kw_max_avg), content length, and sentiment polarity are the most predictive variables. These features together capture both what the article is about and how it communicates — two key drivers of audience engagement.
Overall, while predicting the exact number of shares remains difficult, predictive classification offers valuable editorial guidance. Editors can use such models to assess potential virality before publication, enabling data-driven decisions in content planning, promotion timing, and SEO optimization.
Key Predictive Features Identified
Rank
Feature
Interpretation
1
kw_avg_avg
Average popularity of used keywords; high values indicate topics readers already engage with.
2
kw_max_avg
Strongest keyword appeal; articles with trending keywords attract more attention.
3
n_tokens_content
Longer, in-depth articles tend to get shared more.
4
global_sentiment_polarity
Positive tone enhances reader resonance.
5
num_hrefs
Moderate external linking boosts credibility and engagement.
Conclusion:
Even basic machine learning models help us understand things. News organizations can improve their content before publishing by figuring out what makes something popular.
Source Code
---title: "Online News Popularity Analysis — Predicting Article Shares Before Publication"author: "Christopher Legarda"date: todayjupyter: python3# Page + themeformat: html: theme: cosmo toc: true toc-location: left toc-depth: 3 number-sections: false code-fold: true code-summary: "Show code" code-tools: true df-print: paged smooth-scroll: true anchor-sections: true fig-width: 8 fig-height: 5 fig-align: center tbl-cap-location: top fig-cap-location: bottom # Uncomment if/when you want PDF or Word: # pdf: # documentclass: scrreprt # toc: true # number-sections: true # docx: # toc: true# Execution controlsexecute: echo: false include: false warning: false message: false cache: true freeze: auto # re-run only when code changes# Nice title bannertitle-block-banner: truepage-layout: full# (Optional) params for quick filtering, etc.# params:# year_min: 2020# year_max: 2024---```{python}import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snspd.set_option("display.max_columns", 61)pd.set_option("display.width", 120)sns.set(style="whitegrid", palette="muted")``````{python}file_path =r"C:\Users\Christopher\Documents\Python Projects\Online_News_Popularity\OnlineNewsPopularity.csv"df = pd.read_csv(file_path)print(f"Shape of the dataset is {df.shape}")df.head(10)```<!-- ## **Step 2: Inspect dataset structure** -->```{python}df.info()``````{python}df.describe().T.head(10)```<!-- ### **Step 3: Targer variable (`shares`)** -->```{python}df.columns = df.columns.str.strip()df['shares'].describe()plt.figure(figsize=(9,5))sns.histplot(df["shares"], bins=100, kde=False)plt.title("Distribution of Shares")plt.xlabel("Shares")plt.ylabel("Count")plt.show()# Log-transform for skewplt.figure(figsize=(9,5))sns.histplot(np.log1p(df["shares"]), bins=100, kde=True)plt.title("Distribution of log(1 + Shares)")plt.xlabel("log(1 + Shares)")plt.ylabel("Count")plt.show()```<!-- Raw shares are very skewed, with most articles under a few thousand and a small number way higher. On a linear scale, the few viral articles stretch the axis, so the histogram collapses into a few bars. A log-transform rescales shares so the distribution is more balanced and easier to compare. That’s why editorial analytics almost always works with log-shares rather than raw shares. -->```{python}plt.figure(figsize=(12,6))sns.histplot(np.log1p(df["shares"]), bins=100, kde=True)# Better labelsplt.title("Distribution of Article Shares (log-scaled)")plt.xlabel("Number of Shares")plt.ylabel("Number of Articles")# Replace log ticks with raw share numbersticks = [4, 6, 8, 10, 12] # log valueslabels = [f"{int(np.expm1(tick)):,}"for tick in ticks] # back-transform to raw sharesplt.xticks(ticks, labels)plt.show()``````{python}plt.figure(figsize=(12,6))sns.histplot(np.log1p(df["shares"]), bins=100, kde=True)plt.title("Distribution of Article Shares (log-scaled)")plt.xlabel("Number of Shares")plt.ylabel("Number of Articles")# Choose more tick positions (log values)ticks = np.arange(2, 14, 1) # from 4 to 12, step = 1labels = [f"{int(np.expm1(tick)):,}"for tick in ticks] # back-transformplt.xticks(ticks, labels, rotation=45) # rotate to avoid overlapplt.show()``````{python}import numpy as npdef articles_near_share(df, share_value, bins=100):""" Given a share count, return its log1p value and the number of articles in the same histogram bin. """ log_val = np.log1p(share_value)# Create histogram bins on log scale counts, bin_edges = np.histogram(np.log1p(df["shares"]), bins=bins)# Find which bin the log_val falls into bin_idx = np.digitize(log_val, bin_edges) -1# Guard against edge casesif bin_idx <0or bin_idx >=len(counts):return log_val, 0, (None, None)# Bin range (low, high) for reference bin_range = (bin_edges[bin_idx], bin_edges[bin_idx+1])return log_val, counts[bin_idx], bin_range``````{python}log_val, count, bin_range = articles_near_share(df, 2000, bins=100)print(f"log1p(2000) = {log_val:.2f}")print(f"Articles in this bin: {count}")print(f"Bin covers log range {bin_range[0]:.2f} – {bin_range[1]:.2f}")```<!-- ### **Step 4: Missing values + duplicates** -->```{python}# Temporarily disable truncationpd.set_option("display.max_rows", None)missing_values = df.isna().sum().sort_values(ascending=False)print(missing_values)# Reset to default (so later outputs don’t spam your screen)pd.reset_option("display.max_rows")``````{python}if"url"in df.columns:print("Duplicate URLs:", df["url"].duplicated().sum())``````{python}df.duplicated().sum() # full row duplicates```# 1. SummaryThis project analyzes the *Online News Popularity* dataset from Kaggle to create a business-focused report.I selected this specific dataset to show my data analysis skills towards more business orientated reporting, oppose to community impact/social impact reporting, where this report focuses on data-driven insights to support business decisions.This dataset has over 39,000 online articles; features such as content structure, keyword performance, sentiment, and the engagement score (measured via social media shares) characterize each article.In this report, I’ll try to convert raw engagement data into actionable business insights by integrating exploratory visualization, feature analysis, and predictive modeling, which could help digital media organizations improve their content planning, publishing, and positioning.# 2. Business Problem & Analytical QuestionsIn digital media, article engagement is the main metric of its success. Understanding why certain articles “go viral” while others do not can help organizations make informed decisions and plan advertising strategies and editorial resource allocation.This project aims to explore the factors that influence online engagement and provide insights into how publishers can increase shareability, visibility, and reader retention.The report seeks to identify data-driven strategies for improving content performance and to evaluate the possibility of predicting an article’s virality before its publication.To achieve this, the analysis is guided by six key business questions:| # | Business Question | Focus Area ||---|--------------------|-------------|| **1.** | What types of content drive the most engagement? | Topic & category performance || **2.** | Does article length affect popularity? | Content structure & readability || **3.** | How do timing and recency matter? | Publishing schedule optimization || **4.** | What role do visuals and links play? | Multimedia & SEO strategy || **5.** | Do sentiment and keywords influence sharing? | Tone, emotion, and keyword strength || **6.** | Can we predict article popularity before publication? | Predictive modeling for editorial planning |Together, these questions form a **data-to-decision framework** that mirrors the real-world analytics process — from raw data exploration to business insights and predictive forecasting.# 3. Data Overview & Methodology## 3.1 Data Source & DescriptionThe dataset used in this analysis is the **Online News Popularity** dataset, originally published by **Mashable** and made available through the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity) and on [Kaggle](https://www.kaggle.com/datasets/thehapyone/uci-online-news-popularity-data-set).In this downloadable dataset, it contains 39,644 rows (online news articles), collected between January 2013 and December 2014, where each labeled with metadata, content features, and social media engagement metrics (measured by the total number of shares on platforms like Facebook, LinkedIn, and Google+).Each row in the dataset represents a single article, and each column captures an attribute describing aspects of its content, publication timing, sentiment, and keyword performance.This dataset provides an opportunity to explore business focus data analysis by identifying engagement patterns to inform editorial strategy, SEO optimization, and content marketing decisions.## 3.2 Key Feature CategoriesThe table below shows features that are grouped into meaningful categories that reflect editorial and marketing considerations, making the data more interpretable for business stakeholders.| Feature Category | Example Columns | Business Interpretation ||------------------|-----------------|--------------------------|| **Content Length** |`n_tokens_title`, `n_tokens_content`| Title and article word counts — indicate readability and information depth. || **Multimedia Richness** |`num_imgs`, `num_videos`| Number of images and videos — measure visual engagement level. || **SEO & Linking Strategy** |`num_hrefs`, `num_self_hrefs`| Counts of external and internal hyperlinks — reflect how optimized an article is for search and cross-navigation. || **Keyword Performance** |`kw_min_avg`, `kw_max_avg`, `kw_avg_avg`| Aggregated keyword popularity metrics — proxy for topic relevance and trend appeal. || **Sentiment & Tone** |`global_sentiment_polarity`, `global_subjectivity`, `title_sentiment_polarity`, `title_subjectivity`| Emotional tone and objectivity of both the article and its title. || **Timing Features** |`weekday_is_*`, `is_weekend`, `timedelta`| Publication day and recency — reveal engagement patterns over time. || **Engagement Outcome** |`shares`| The number of times an article was shared — the main measure of popularity and performance. |## 3.3 Analytical ApproachThis report uses a set method to go from finding patterns to providing useful information. Below is a list of analytical techniques that are commonly found in business analytics that I’ve implemented in this report:1. Descriptive Analytics–Examine historical data to identify how content type, sentiment, visuals, and timing relate to engagement.2. Diagnostic Analytics–Explore why certain articles perform better using correlation analysis, segmentation, and visual storytelling.3. Predictive Analytics–Build a simple model to evaluate whether pre-publication article features can predict shareability or “virality.”I conducted the analysis using Python on Jupyter Notebook, including libraries like ‘pandas’ and ‘numpy’ for data manipulation, ‘matplotlib’ and ‘seaborn’ for visualization, and ‘scikit-learn’ for statistical modeling and prediction.Since this report targets non-technical and business audiences, it emphasizes the findings and insights rather than the code implementation itself.Before analysis, I’ve performed all data preparation steps, such as handling missing values, cleaning, and transformations, to ensure data accuracy and consistency.# 4. Exploratory Data Analysis (EDA & Insights)This section is the main starting point of the report, which explores through six of the business-driven analytical questions that I’ve mentioned above. Each subsection below presents visuals, descriptive statistics, and summarized insights.## Question 1 — What types of content drive the most engagement?**Business Context:** Understanding which content categories perform best helps editorial teams prioritize topics that consistently attract higher engagement.This analysis examines which Mashable sections (e.g., Lifestyle, Business, Technology, etc.) receive the most shares.**Approach:** - Group articles by `channel` (e.g., Lifestyle, Business, Tech, etc.) - Compare their median and distribution of `shares`- Visualize with bar and box plots to identify engagement patterns```{python}# 1) Identify the 6 channel columns and coerce to numeric 0/1channel_cols = [c for c in df.columns if c.startswith("data_channel_is_")]df[channel_cols] = df[channel_cols].fillna(0).astype(int)# 2) Row-wise sum to detect uncategorized or (rare) multi-labeled rowsrow_sum = df[channel_cols].sum(axis=1)# 3) Assign channel only for rows with exactly one active flag; else mark Uncategorizedchannel_raw = df[channel_cols].idxmax(axis=1).str.replace("data_channel_is_", "")channel_raw = np.where(row_sum ==0, "uncategorized", channel_raw) # all zeroschannel_raw = np.where(row_sum >1, "uncategorized", channel_raw) # safety for ties# 4) Map to presentation nameschannel_mapping = {"lifestyle": "Lifestyle","entertainment": "Entertainment","bus": "Business","socmed": "Social Media","tech": "Technology","world": "World","uncategorized": "Uncategorized"}df["channel"] = pd.Series(channel_raw).str.lower().map(channel_mapping)# 5) Sanity checksprint("Rows with all zeros:", (row_sum ==0).sum())print("Rows with >1 channels:", (row_sum >1).sum())print(df["channel"].value_counts(dropna=False))``````{python}df``````{python}channel_summary = ( df.groupby("channel")["shares"] .agg(["count", "median", "mean"]) .sort_values("median", ascending=False))channel_summary``````{python}import numpy as np# target on log-scale for robust comparisonsdf["log_shares"] = np.log1p(df["shares"])channel_summary = ( df.groupby("channel", observed=True)["shares"] .agg(count="size", median="median", mean="mean") .sort_values("median", ascending=False))channel_summary``````{python}# order = channel_summary.index.tolist()# plt.figure(figsize=(10,6))# ax = sns.barplot(data=df, x="channel", y="shares", estimator=np.median, order=order, ci=95)# # Title and labels# ax.set_title("Typical Article Shares by Content Channel")# ax.set_xlabel("Channel")# ax.set_ylabel("Median Shares")# # Choose y-ticks (adjust range depending on your data spread)# max_val = df.groupby("channel")["shares"].median().max()# yticks = np.linspace(0, max_val, 8) # 8 evenly spaced ticks# ax.set_yticks(yticks)# ax.set_yticklabels([f"{int(t):,}" for t in yticks])# # --- Add data labels ---# medians = df.groupby("channel")["shares"].median().reindex(order)# for i, (cat, val) in enumerate(medians.items()):# ax.text(i, val + (0.02 * max_val), f"{int(val):,}", # place slightly above bar# ha="center", va="bottom", fontsize=9, fontweight="bold")# plt.tight_layout()# plt.show()``````{python}#| echo: false#| include: trueorder = channel_summary.index.tolist()plt.figure(figsize=(10,6))ax = sns.barplot(data=df, x="channel", y="shares", estimator=np.median, order=order, ci=None)# Title and labelsax.set_title("Typical Article Shares by Content Channel")ax.set_xlabel("Channel")ax.set_ylabel("Median Shares")# Choose y-ticksmax_val = df.groupby("channel")["shares"].median().max()yticks = np.linspace(0, max_val, 8)ax.set_yticks(yticks)ax.set_yticklabels([f"{int(t):,}"for t in yticks])# --- Add data labels ---medians = df.groupby("channel")["shares"].median().reindex(order)for i, (cat, val) inenumerate(medians.items()): ax.text(i, val + (0.02* max_val), f"{int(val):,}", ha="center", va="bottom", fontsize=9, fontweight="bold")plt.tight_layout()plt.show()``````{python}plt.figure(figsize=(12,6))ax = sns.boxplot( data=df, x="channel", y="shares", order=order, showfliers=False# hide extreme outliers so boxes are easier to read)ax.set_title("Distribution of Article Shares by Content Channel")ax.set_xlabel("Channel")ax.set_ylabel("Shares")# Format y-ticks with commasyticks = ax.get_yticks()ax.set_yticklabels([f"{int(t):,}"for t in yticks])plt.tight_layout()plt.show()``````{python}def boxplot_stats(x): q1 = x.quantile(0.25) q3 = x.quantile(0.75) iqr = q3 - q1# Tukey's fences (1.5× IQR rule) lower_whisker = x[x >= (q1 -1.5* iqr)].min() upper_whisker = x[x <= (q3 +1.5* iqr)].max()return pd.Series({"Q0 (whisker)": lower_whisker,"Q1 (25th)": q1,"Median (Q2)": x.median(),"Q3 (75th)": q3,"Q4 (whisker)": upper_whisker,"Min": x.min(),"Max": x.max() })# Apply per channelchannel_box_stats = df.groupby("channel")["shares"].apply(boxplot_stats).unstack()# Format with commaschannel_box_stats = channel_box_stats.applymap(lambda v: f"{int(v):,}")channel_box_stats```**Insight Summary:**We found that Social Media articles drive the highest engagement, with a median of ~2,100 shares and strong viral upside. Right after Social Media will be Uncategorized with a median of 1,900. Lifestyle and Technology follow closely, offering reliable performance. Business and World articles generate lower engagement, even though they are more stable. This suggests that editorial resources should weight toward Social Media, Uncategorized, Lifestyle, and Tech if you want to maximize reach.## Question 2 — Does article length affect popularity?**Business Context:** Article length can influence how readers engage with certain content. Shorter articles may seem attractive to people who want to have a quick read, while longer articles can deliver more depth and detail.This analysis will attempt to determine whether title length and content length (measured by estimated reading time) influence how often people share an article.**Approach:** - Analyze the distribution of `n_tokens_title` and `n_tokens_content` to understand writing patterns. - Convert content length into estimated reading time, assuming an average reading speed of 200 words per minute. - Group articles into title length and reading time bins, then calculate median shares within each group to detect engagement trends. ```{python}print("\n---Statistical description on `n_tokens_title`---\n")df["n_tokens_title"].describe()``````{python}print("\n---Statistical description on `n_tookens_content`---\n")df["n_tokens_content"].describe()``````{python}#| echo: false#| include: trueplt.figure(figsize=(12,6))sns.histplot(df["n_tokens_title"], bins=30, kde=True)plt.title("Distribution of Title Length (words)")plt.xlabel("Title word count")plt.ylabel("Number of articles")plt.show()plt.figure(figsize=(12,6))sns.histplot(df["n_tokens_content"], bins=50, kde=True)plt.title("Distribution of Content Length (words)")plt.xlabel("Content word count")plt.ylabel("Number of articles")plt.show()``````{python}plt.figure(figsize=(12,6))sns.scatterplot(data=df, x="n_tokens_title", y="shares", alpha=0.3)sns.regplot(data=df, x="n_tokens_title", y="shares", scatter=False, color="red")plt.title("Shares vs Title Length")plt.xlabel("Title word count")plt.ylabel("Number of shares per article")plt.ylim(0, 20000)plt.show()plt.figure(figsize=(10,6))sns.scatterplot(data=df, x="n_tokens_content", y="shares", alpha=0.3)sns.regplot(data=df, x="n_tokens_content", y="shares", scatter=False, color="red")plt.title("Shares vs Content Length")plt.xlabel("Content word count")plt.ylabel("Shares")plt.ylim(0, 20000) # cap for readabilityplt.show()``````{python}df["n_tokens_title"].describe()``````{python}df["n_tokens_content"].describe()``````{python}# Title binsdf["title_len_bin"] = pd.cut( df["n_tokens_title"], bins=[0,5,10,15,20,60], labels=["0–5","6–8","9–12","13–15","16+"])# Content bins (convert to approx read time at 200 wpm)df["read_time_min"] = df["n_tokens_content"] /200df["content_len_bin"] = pd.cut( df["read_time_min"], bins=[0,2,5,10,20,60], labels=["0–2 min","3–5 min","6–10 min","11–20 min","20+ min"])``````{python}title_summary = df.groupby("title_len_bin")["shares"].median()content_summary = df.groupby("content_len_bin")["shares"].median()print("Median shares by title length bin:\n", title_summary)print("\nMedian shares by content length bin:\n", content_summary)``````{python}#| echo: false#| include: trueplt.figure(figsize=(12,6))sns.barplot(x=title_summary.index, y=title_summary.values)plt.title("Median Shares by Title Length")plt.xlabel("Title word count range")plt.ylabel("Median Shares")plt.show()plt.figure(figsize=(12,6))sns.barplot(x=content_summary.index, y=content_summary.values)plt.title("Median Shares by Content Length (Reading Time)")plt.xlabel("Estimated Reading Time")plt.ylabel("Median Shares")plt.show()``````{python}plt.figure(figsize=(12,6))sns.boxplot( data=df, x="title_len_bin", y="shares", hue="channel", showfliers=False)plt.title("Shares by Title Length (split by Channel)")plt.ylabel("Shares")plt.xlabel("Title word count bin")plt.legend(bbox_to_anchor=(1.05,1), loc="upper left")plt.show()```**Insight Summary:** The first chart shows that the title lengths follow a roughly normal distribution, centered on ten words, showing that most editors in the dataset usually around that amount of title length. However, most articles are relatively short, right under 2,000 words, and content length is highly skewed to the right. We can also see in the same chart that a tiny fraction of the articles extends over 2,000 (higher content length). When comparing the two engagement bar charts, we can see that Median Shares by Title Length containing sixteen or more words generated the highest median shares, showing that more words in the title section attract readers’ attention and encourage sharing. Similarly, in the Median Shares by Content Length (Reading Time), the chart reads that articles that take over twenty minutes to read also achieved the highest median share counts. This can suggest that readers are more likely to share articles that have more words and details, possibly these articles convey unique insights. However, since such long articles are relatively rare, the results likely reflect the higher quality and editorial investment in these pieces rather than length alone.## Question 3 — How do timing and recency matter?**Business Context:** In news media, timing affects visibility, its competition and user behavior. Identifying when engagement peaks can help optimize content scheduling.**Approach:** - Analyze `weekday_is_*` and `is_weekend` variables - Compare median shares by weekday and weekend - Evaluate `timedelta` (days since publication) to understand recency impact```{python}print("\n---Statistical Summary For `timedelta`---\n")df["timedelta"].describe()``````{python}plt.figure(figsize=(10,5))sns.scatterplot(data=df, x="timedelta", y="shares", alpha=0.3)from statsmodels.nonparametric.smoothers_lowess import lowesslowess_smoothed = lowess(df["shares"], df["timedelta"], frac=0.05)plt.plot(lowess_smoothed[:,0], lowess_smoothed[:,1], color="red", linewidth=2)plt.title("Shares vs Article Recency (timedelta)")plt.xlabel("Days since publication (timedelta)")plt.ylabel("Shares")plt.ylim(0, 20000)plt.show()``````{python}weekday_cols = [c for c in df.columns if c.startswith("weekday_is_")]weekday_map = {"weekday_is_monday": "Monday","weekday_is_tuesday": "Tuesday","weekday_is_wednesday": "Wednesday","weekday_is_thursday": "Thursday","weekday_is_friday": "Friday","weekday_is_saturday": "Saturday","weekday_is_sunday": "Sunday"}weekday_df = df[weekday_cols]df["weekday"] = weekday_df.idxmax(axis=1).map(weekday_map)``````{python}weekday_summary = ( df.groupby("weekday")["shares"] .median() .sort_values(ascending=False))print("Median shares by weekday:\n", weekday_summary)``````{python}#| echo: false#| include: trueplt.figure(figsize=(12,6))sns.barplot(data=df, x="weekday", y="shares", estimator=np.median, order=weekday_summary.index, ci=None)plt.title("Median Shares by Weekday")plt.xlabel("Day of the Week")plt.ylabel("Median Shares")plt.tight_layout()plt.show()``````{python}#| echo: false#| include: true# Dataset already has is_weekend columnweekend_summary = df.groupby("is_weekend")["shares"].median()weekend_summary.index = ["Weekday","Weekend"]plt.figure(figsize=(10,5))sns.barplot(x=weekend_summary.index, y=weekend_summary.values)plt.title("Median Shares: Weekday vs Weekend")plt.xlabel("")plt.ylabel("Median Shares")plt.show()```**Insight Summary:** The analysis shows a clear temporal pattern in audience engagement. Articles published on the weekends significantly outperform those released during the weekdays, with Saturday reaching the highest median share count (around 2,000) and Sunday following closely behind.Looking at the Median Shares: Weekday vs. Weekend, we can see that weekday articles average between 1,300 and 1,500 median shares, suggesting lower reader interaction compared to the weekend, where articles average between 1,800 and 2,000. This pattern implies that readers are more likely to discover and share content during the weekend compared to traditional working hours. For editorial teams, these findings suggest that scheduling major stories for feature articles for weekend publication could improve shareability and overall reach.## Question 4 — What role do visuals and links play?**Business Context:** Different visuals and hyperlinks play a key role in how readers engage with online content. For example, images and videos can enhance storytelling and emotional connections, while hyperlinks can improve credibility, depth and SEO performance.**Approach:**- Examine the distributions of `num_imgs`, `num_videos`, `num_hrefs`, and `num_self_hrefs`. - Compare median shares across binned groups (e.g., 0, 1–2, 3–5, 6–10, 10+). - Use Spearman correlation to measure non-linear relationships between these variables and `shares`. - Identify whether richer multimedia or higher hyperlink counts consistently align with stronger engagement.```{python}visual_cols = ["num_imgs", "num_videos", "num_hrefs", "num_self_hrefs"]df[visual_cols].describe().T``````{python}fig, axes = plt.subplots(2, 2, figsize=(12,8))for ax, col inzip(axes.flat, visual_cols): sns.histplot(df[col], bins=50, ax=ax) ax.set_title(f"Distribution of {col}") ax.set_xlim(0, 50)plt.tight_layout()plt.show()``````{python}for col in visual_cols: corr = df[col].corr(df["shares"], method="spearman")print(f"Spearman correlation between {col} and shares: {corr:.3f}")``````{python}# Define binsdf["img_bin"] = pd.cut(df["num_imgs"], bins=[-1,0,2,5,10,100], labels=["0","1–2","3–5","6–10","10+"])df["video_bin"] = pd.cut(df["num_videos"], bins=[-1,0,1,3,10,100], labels=["0","1","2–3","4–10","10+"])df["href_bin"] = pd.cut(df["num_hrefs"], bins=[-1,5,10,20,40,1000], labels=["0–5","6–10","11–20","21–40","40+"])df["self_href_bin"] = pd.cut(df["num_self_hrefs"], bins=[-1,0,2,5,10,100], labels=["0","1–2","3–5","6–10","10+"])``````{python}#| echo: false#| include: true# Plot median shares for each binfig, axes = plt.subplots(2, 2, figsize=(12,8))sns.barplot(data=df, x="img_bin", y="shares", estimator=np.median, ax=axes[0,0], ci=None)axes[0,0].set_title("Median Shares by # Images")sns.barplot(data=df, x="video_bin", y="shares", estimator=np.median, ax=axes[0,1], ci=None)axes[0,1].set_title("Median Shares by # Videos")sns.barplot(data=df, x="href_bin", y="shares", estimator=np.median, ax=axes[1,0], ci=None)axes[1,0].set_title("Median Shares by # Hyperlinks")sns.barplot(data=df, x="self_href_bin", y="shares", estimator=np.median, ax=axes[1,1], ci=None)axes[1,1].set_title("Median Shares by # Self-References")for ax in axes.flat: ax.set_ylabel("Median Shares") ax.set_xlabel("")plt.tight_layout()plt.show()```**Insight Summary:** The analysis above shows that multimedia and linking result in high shareability. Looking at the graph Median Shares by # Images, it shows that 10 and more images or around 20 through 40 hyperlinks reveal higher median shares, suggesting that visual and reference links enhance readers' willingness to share the articles.Content quality and contextual relevance of visuals and links matter more than their sheer quantity, guiding editors to favor well-curated, purposeful multimedia integration over volume.## Question 5 — Do sentiment and keywords influence sharing?**Business Context:** For this question, I will be answering how certain keywords and sets affect the emotional tone of the article, which drives online engagement. Articles that have the right emotional balance, or align with the trend, are often associated with high-performing keywords, and are more likely to attract attention and sharing.This analysis explores how sentiment *polarity*, *subjectivity*, and keyword strength contribute to article popularity.**Approach:** - Evaluate sentiment variables (`global_sentiment_polarity`, `title_sentiment_polarity`, `global_subjectivity`, `title_subjectivity`) to measure tone and emotional intensity. - Examine keyword-based metrics (`kw_min_avg`, `kw_max_avg`, `kw_avg_avg`) that represent how popular or widely shared those keywords are across the entire dataset. - Compare median shares across binned sentiment and keyword quartiles to identify which emotional tones and topic strengths are most associated with engagement.```{python}#| echo: false#| include: truekw_cols = ["kw_min_avg", "kw_max_avg", "kw_avg_avg"]# Summary statsprint(df[kw_cols].describe().T)# Correlations with sharesfor col in kw_cols: corr = df[col].corr(df["shares"], method="spearman")print(f"Spearman correlation between {col} and shares: {corr:.3f}")# Histogramsfig, axes = plt.subplots(1, 3, figsize=(15,4))for ax, col inzip(axes, kw_cols): sns.histplot(df[col], bins=50, ax=ax) ax.set_title(f"Distribution of {col}")plt.tight_layout()plt.show()``````{python}#| echo: false#| include: truefor col in kw_cols: df[f"{col}_bin"] = pd.qcut(df[col], 4, labels=["Q1 (low)","Q2","Q3","Q4 (high)"]) plt.figure(figsize=(8,4)) sns.barplot(data=df, x=f"{col}_bin", y="shares", estimator=np.median, ci=None) plt.title(f"Median Shares by {col} Quartile") plt.ylabel("Median Shares") plt.xlabel(col) plt.show()``````{python}#| echo: false#| include: true# Polarity: [-1, 1], Subjectivity: [0, 1]sentiment_cols = ["global_sentiment_polarity","global_subjectivity","title_sentiment_polarity","title_subjectivity",]for col in sentiment_cols:if"subjectivity"in col: bins = [0.0, 0.25, 0.50, 0.75, 1.0] labels = ["0–0.25", "0.25–0.5", "0.5–0.75", "0.75–1.0"] s = df[col].clip(0, 1) # safety xlabel =f"{col} (quartile-like fixed bins)"else: bins = [-1.0, -0.25, 0.0, 0.25, 1.0] labels = ["Very neg (≤-0.25)", "Slight neg (-0.25–0)","Slight pos (0–0.25)", "Very pos (≥0.25)"] s = df[col].clip(-1, 1) # safety xlabel =f"{col} (fixed polarity bins)" df[f"{col}_bin"] = pd.cut(s, bins=bins, labels=labels, include_lowest=True) plt.figure(figsize=(8,4)) sns.barplot(data=df, x=f"{col}_bin", y="shares", estimator=np.median, ci=None) plt.title(f"Median Shares by {col} bin") plt.ylabel("Median Shares") plt.xlabel(xlabel) plt.tight_layout() plt.show()```**Insight Summary:** The analysis confirms that both emotional tone and keyword popularity have meaningful effects on shareability.Articles with a positive sentiment polarity, especially those written with optimistic or emotionally expressive language, receive more shares. Similarly, subjective writing, where author’s present opinions or emotional perspectives, also correlates with higher engagement, likely because such content feels more relatable and authentic to readers.When examining keyword metrics, two variables stood out as the most influential:- `kw_avg_avg` — representing the average popularity of all keywords in an article, reflecting overall topic appeal.- `kw_max_avg` — capturing the strongest or trendiest keyword within an article, showing viral potential tied to a single trending topic.Articles that scored high on both measures saw significantly more shares, demonstrating that using popular or high-traffic topics boosts visibility.Since these variables showed the strongest and most stable correlations with article shares, the team selected them as core predictive features for the modeling phase in Question 6.In summary, positive tone, expressive style, and keyword relevance are key elements of viral content.This finding bridges the exploratory analysis to predictive modeling, providing a data-driven rationale for which factors best forecast article success.## Question 6 — Can we predict article popularity before publication?**Business Context:** If editors could estimate the popularity of an article before publishing, they could strategically allocate resources by prioritizing high-affected stories, optimizing headlines, and tailoring content strategy around engagement potential.To analyze this problem, researchers reframed it as a predictive analytics task, testing whether metadata and pre-publication features (like sentiment, keywords, and structure) can meaningfully forecast share counts.**Approach:** - Model 1: **Regression** — Predict the continuous number of `shares` using Random Forest Regressor. - Model 2: **Classification** — Label each article as *viral* or *non-viral* using the median shares threshold, then predict the class with a Random Forest Classifier. - Evaluate model performance using key metrics: - **Regression:** R², MAE, RMSE - **Classification:** Accuracy, F1-score, and ROC-AUC - Analyze **feature importance** to identify which article attributes most influence shareability.```{python}import pandas as pdimport numpy as npfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import OneHotEncoder, StandardScalerfrom sklearn.compose import ColumnTransformerfrom sklearn.pipeline import Pipelinefrom sklearn.linear_model import LinearRegression, LogisticRegressionfrom sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, accuracy_score, f1_score, roc_auc_score, confusion_matrixfrom sklearn.ensemble import RandomForestRegressor, RandomForestClassifierimport matplotlib.pyplot as pltimport seaborn as sns``````{python}features = ["n_tokens_title", "n_tokens_content","num_imgs", "num_videos", "num_hrefs", "num_self_hrefs","kw_avg_avg", "kw_max_avg","global_sentiment_polarity", "global_subjectivity","title_sentiment_polarity", "title_subjectivity","channel", "weekday", "is_weekend"]target ="shares"``````{python}df_model = df[features + [target]].dropna().copy()``````{python}numeric_features = ["n_tokens_title","n_tokens_content","num_imgs","num_videos","num_hrefs","num_self_hrefs","kw_avg_avg","kw_max_avg","global_sentiment_polarity","global_subjectivity","title_sentiment_polarity","title_subjectivity","is_weekend"]categorical_features = ["channel","weekday"]preprocess = ColumnTransformer([ ("num", StandardScaler(), numeric_features), ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)])``````{python}X = df_model[features]y = df_model["shares"]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)reg_pipeline = Pipeline([ ("prep", preprocess), ("model", LinearRegression())])reg_pipeline.fit(X_train, y_train)y_pred = reg_pipeline.predict(X_test)``````{python}#| echo: false#| include: trueimport pandas as pdmetrics = {"Metric": ["R²", "MAE", "RMSE"],"Value": [r2_score(y_test, y_pred),mean_absolute_error(y_test, y_pred),np.sqrt(mean_squared_error(y_test, y_pred))]}pd.DataFrame(metrics).style.format({"Value": "{:,.3f}"})``````{python}median_shares = df_model["shares"].median()df_model["is_viral"] = (df_model["shares"] > median_shares).astype(int)``````{python}X = df_model[features]y = df_model["is_viral"]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)``````{python}clf_pipeline = Pipeline([ ("prep", preprocess), ("model", LogisticRegression(max_iter=1000))])clf_pipeline.fit(X_train, y_train)y_pred = clf_pipeline.predict(X_test)y_prob = clf_pipeline.predict_proba(X_test)[:,1]``````{python}#| echo: false#| include: trueimport pandas as pdfrom sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix# --- Compute metrics ---accuracy = accuracy_score(y_test, y_pred)f1 = f1_score(y_test, y_pred)roc_auc = roc_auc_score(y_test, y_prob)cm = confusion_matrix(y_test, y_pred)# --- Create summary table ---metrics = {"Metric": ["Accuracy", "F1 Score", "ROC-AUC"],"Value": [accuracy, f1, roc_auc]}metrics_df = pd.DataFrame(metrics).style.format({"Value": "{:.3f}"})display(metrics_df)# --- Show confusion matrix separately ---cm_df = pd.DataFrame(cm,index=["Actual: Non-Viral", "Actual: Viral"],columns=["Predicted: Non-Viral", "Predicted: Viral"])display(cm_df)``````{python}#| echo: false#| include: truerf_clf = Pipeline([ ("prep", preprocess), ("model", RandomForestClassifier(n_estimators=200, random_state=42))])rf_clf.fit(X_train, y_train)# Get feature importancesfeature_names = ( numeric_features+list(rf_clf.named_steps["prep"].named_transformers_["cat"].get_feature_names_out(categorical_features)))importances = rf_clf.named_steps["model"].feature_importances_feat_imp = pd.DataFrame({"feature": feature_names, "importance": importances}).sort_values("importance", ascending=False)plt.figure(figsize=(8,6))sns.barplot(y="feature", x="importance", data=feat_imp.head(15))plt.title("Top Predictive Features for Viral Articles")plt.tight_layout()plt.show()```**Insight Summary:** The regression model achieved an R² of 0.019, with an MAE around 3,030 and an RMSE near 10,692, showing that predicting exact share counts is highly uncertain. Audience engagement remains influenced by external and unpredictable factors (like news cycles and platform dynamics).However, the classification model performed more reliably, reaching ~64% accuracy, F1 = 0.63, and ROC-AUC = 0.69, meaning it can correctly distinguish “viral” from “non-viral” articles more often than random chance. The confusion matrix shows balanced results between the two classes, confirming that the model generalizes moderately well.Feature importance analysis highlights that keyword popularity metrics (`kw_avg_avg`, `kw_max_avg`), content length, and sentiment polarity are the most predictive variables. These features together capture both what the article is about and how it communicates — two key drivers of audience engagement.Overall, while predicting the exact number of shares remains difficult, predictive classification offers valuable editorial guidance. Editors can use such models to assess potential virality before publication, enabling data-driven decisions in content planning, promotion timing, and SEO optimization.### Key Predictive Features Identified| Rank | Feature | Interpretation ||------|----------|----------------|| 1 |`kw_avg_avg`| Average popularity of used keywords; high values indicate topics readers already engage with. || 2 |`kw_max_avg`| Strongest keyword appeal; articles with trending keywords attract more attention. || 3 |`n_tokens_content`| Longer, in-depth articles tend to get shared more. || 4 |`global_sentiment_polarity`| Positive tone enhances reader resonance. || 5 |`num_hrefs`| Moderate external linking boosts credibility and engagement. |**Conclusion:** Even basic machine learning models help us understand things. News organizations can improve their content before publishing by figuring out what makes something popular.