1. Summary

This project analyzes the Online News Popularity dataset from Kaggle to create a business-focused report.

I selected this specific dataset to show my data analysis skills towards more business orientated reporting, oppose to community impact/social impact reporting, where this report focuses on data-driven insights to support business decisions.

This dataset has over 39,000 online articles; features such as content structure, keyword performance, sentiment, and the engagement score (measured via social media shares) characterize each article.

In this report, I’ll try to convert raw engagement data into actionable business insights by integrating exploratory visualization, feature analysis, and predictive modeling, which could help digital media organizations improve their content planning, publishing, and positioning.

2. Business Problem & Analytical Questions

In digital media, article engagement is the main metric of its success. Understanding why certain articles “go viral” while others do not can help organizations make informed decisions and plan advertising strategies and editorial resource allocation.

This project aims to explore the factors that influence online engagement and provide insights into how publishers can increase shareability, visibility, and reader retention.

The report seeks to identify data-driven strategies for improving content performance and to evaluate the possibility of predicting an article’s virality before its publication.

To achieve this, the analysis is guided by six key business questions:

#	Business Question	Focus Area
1.	What types of content drive the most engagement?	Topic & category performance
2.	Does article length affect popularity?	Content structure & readability
3.	How do timing and recency matter?	Publishing schedule optimization
4.	What role do visuals and links play?	Multimedia & SEO strategy
5.	Do sentiment and keywords influence sharing?	Tone, emotion, and keyword strength
6.	Can we predict article popularity before publication?	Predictive modeling for editorial planning

Together, these questions form a data-to-decision framework that mirrors the real-world analytics process — from raw data exploration to business insights and predictive forecasting.

3. Data Overview & Methodology

3.1 Data Source & Description

The dataset used in this analysis is the Online News Popularity dataset, originally published by Mashable and made available through the UCI Machine Learning Repository and on Kaggle.

In this downloadable dataset, it contains 39,644 rows (online news articles), collected between January 2013 and December 2014, where each labeled with metadata, content features, and social media engagement metrics (measured by the total number of shares on platforms like Facebook, LinkedIn, and Google+).

Each row in the dataset represents a single article, and each column captures an attribute describing aspects of its content, publication timing, sentiment, and keyword performance.

This dataset provides an opportunity to explore business focus data analysis by identifying engagement patterns to inform editorial strategy, SEO optimization, and content marketing decisions.

3.2 Key Feature Categories

The table below shows features that are grouped into meaningful categories that reflect editorial and marketing considerations, making the data more interpretable for business stakeholders.

Feature Category	Example Columns	Business Interpretation
Content Length	`n_tokens_title`, `n_tokens_content`	Title and article word counts — indicate readability and information depth.
Multimedia Richness	`num_imgs`, `num_videos`	Number of images and videos — measure visual engagement level.
SEO & Linking Strategy	`num_hrefs`, `num_self_hrefs`	Counts of external and internal hyperlinks — reflect how optimized an article is for search and cross-navigation.
Keyword Performance	`kw_min_avg`, `kw_max_avg`, `kw_avg_avg`	Aggregated keyword popularity metrics — proxy for topic relevance and trend appeal.
Sentiment & Tone	`global_sentiment_polarity`, `global_subjectivity`, `title_sentiment_polarity`, `title_subjectivity`	Emotional tone and objectivity of both the article and its title.
Timing Features	`weekday_is_*`, `is_weekend`, `timedelta`	Publication day and recency — reveal engagement patterns over time.
Engagement Outcome	`shares`	The number of times an article was shared — the main measure of popularity and performance.

3.3 Analytical Approach

This report uses a set method to go from finding patterns to providing useful information. Below is a list of analytical techniques that are commonly found in business analytics that I’ve implemented in this report:

Descriptive Analytics–Examine historical data to identify how content type, sentiment, visuals, and timing relate to engagement.
Diagnostic Analytics–Explore why certain articles perform better using correlation analysis, segmentation, and visual storytelling.
Predictive Analytics–Build a simple model to evaluate whether pre-publication article features can predict shareability or “virality.”

I conducted the analysis using Python on Jupyter Notebook, including libraries like ‘pandas’ and ‘numpy’ for data manipulation, ‘matplotlib’ and ‘seaborn’ for visualization, and ‘scikit-learn’ for statistical modeling and prediction.

Since this report targets non-technical and business audiences, it emphasizes the findings and insights rather than the code implementation itself.

Before analysis, I’ve performed all data preparation steps, such as handling missing values, cleaning, and transformations, to ensure data accuracy and consistency.

4. Exploratory Data Analysis (EDA & Insights)

This section is the main starting point of the report, which explores through six of the business-driven analytical questions that I’ve mentioned above. Each subsection below presents visuals, descriptive statistics, and summarized insights.

Question 1 — What types of content drive the most engagement?

Business Context:

Understanding which content categories perform best helps editorial teams prioritize topics that consistently attract higher engagement.

This analysis examines which Mashable sections (e.g., Lifestyle, Business, Technology, etc.) receive the most shares.

Approach:

Group articles by channel (e.g., Lifestyle, Business, Tech, etc.)
Compare their median and distribution of shares
Visualize with bar and box plots to identify engagement patterns

Insight Summary:

We found that Social Media articles drive the highest engagement, with a median of ~2,100 shares and strong viral upside. Right after Social Media will be Uncategorized with a median of 1,900. Lifestyle and Technology follow closely, offering reliable performance. Business and World articles generate lower engagement, even though they are more stable. This suggests that editorial resources should weight toward Social Media, Uncategorized, Lifestyle, and Tech if you want to maximize reach.

Question 2 — Does article length affect popularity?

Business Context:
Article length can influence how readers engage with certain content. Shorter articles may seem attractive to people who want to have a quick read, while longer articles can deliver more depth and detail.

This analysis will attempt to determine whether title length and content length (measured by estimated reading time) influence how often people share an article.

Approach:

Analyze the distribution of n_tokens_title and n_tokens_content to understand writing patterns.
Convert content length into estimated reading time, assuming an average reading speed of 200 words per minute.
Group articles into title length and reading time bins, then calculate median shares within each group to detect engagement trends.

Insight Summary:

The first chart shows that the title lengths follow a roughly normal distribution, centered on ten words, showing that most editors in the dataset usually around that amount of title length. However, most articles are relatively short, right under 2,000 words, and content length is highly skewed to the right. We can also see in the same chart that a tiny fraction of the articles extends over 2,000 (higher content length).

When comparing the two engagement bar charts, we can see that Median Shares by Title Length containing sixteen or more words generated the highest median shares, showing that more words in the title section attract readers’ attention and encourage sharing. Similarly, in the Median Shares by Content Length (Reading Time), the chart reads that articles that take over twenty minutes to read also achieved the highest median share counts. This can suggest that readers are more likely to share articles that have more words and details, possibly these articles convey unique insights. However, since such long articles are relatively rare, the results likely reflect the higher quality and editorial investment in these pieces rather than length alone.

Question 3 — How do timing and recency matter?

Business Context:

In news media, timing affects visibility, its competition and user behavior. Identifying when engagement peaks can help optimize content scheduling.

Approach:

Analyze weekday_is_* and is_weekend variables
Compare median shares by weekday and weekend
Evaluate timedelta (days since publication) to understand recency impact

Insight Summary:

The analysis shows a clear temporal pattern in audience engagement. Articles published on the weekends significantly outperform those released during the weekdays, with Saturday reaching the highest median share count (around 2,000) and Sunday following closely behind.

Looking at the Median Shares: Weekday vs. Weekend, we can see that weekday articles average between 1,300 and 1,500 median shares, suggesting lower reader interaction compared to the weekend, where articles average between 1,800 and 2,000. This pattern implies that readers are more likely to discover and share content during the weekend compared to traditional working hours.

For editorial teams, these findings suggest that scheduling major stories for feature articles for weekend publication could improve shareability and overall reach.

Question 4 — What role do visuals and links play?

Business Context:

Different visuals and hyperlinks play a key role in how readers engage with online content. For example, images and videos can enhance storytelling and emotional connections, while hyperlinks can improve credibility, depth and SEO performance.

Approach:

Examine the distributions of num_imgs, num_videos, num_hrefs, and num_self_hrefs.
Compare median shares across binned groups (e.g., 0, 1–2, 3–5, 6–10, 10+).
Use Spearman correlation to measure non-linear relationships between these variables and shares.
Identify whether richer multimedia or higher hyperlink counts consistently align with stronger engagement.

Insight Summary:

The analysis above shows that multimedia and linking result in high shareability. Looking at the graph Median Shares by # Images, it shows that 10 and more images or around 20 through 40 hyperlinks reveal higher median shares, suggesting that visual and reference links enhance readers’ willingness to share the articles.

Content quality and contextual relevance of visuals and links matter more than their sheer quantity, guiding editors to favor well-curated, purposeful multimedia integration over volume.

Question 5 — Do sentiment and keywords influence sharing?

Business Context:

For this question, I will be answering how certain keywords and sets affect the emotional tone of the article, which drives online engagement.

Articles that have the right emotional balance, or align with the trend, are often associated with high-performing keywords, and are more likely to attract attention and sharing.

This analysis explores how sentiment polarity, subjectivity, and keyword strength contribute to article popularity.

Approach:

Evaluate sentiment variables (global_sentiment_polarity, title_sentiment_polarity, global_subjectivity, title_subjectivity) to measure tone and emotional intensity.
Examine keyword-based metrics (kw_min_avg, kw_max_avg, kw_avg_avg) that represent how popular or widely shared those keywords are across the entire dataset.
Compare median shares across binned sentiment and keyword quartiles to identify which emotional tones and topic strengths are most associated with engagement.

              count         mean          std  min          25%          50%          75%            max
kw_min_avg  39644.0  1117.146610  1137.456951 -1.0     0.000000  1023.635611  2056.781032    3613.039819
kw_max_avg  39644.0  5657.211151  6098.871957  0.0  3562.101631  4355.688836  6019.953968  298400.000000
kw_avg_avg  39644.0  3135.858639  1318.150397  0.0  2382.448566  2870.074878  3600.229564   43567.659946
Spearman correlation between kw_min_avg and shares: 0.103
Spearman correlation between kw_max_avg and shares: 0.223
Spearman correlation between kw_avg_avg and shares: 0.256

Insight Summary:

The analysis confirms that both emotional tone and keyword popularity have meaningful effects on shareability.

Articles with a positive sentiment polarity, especially those written with optimistic or emotionally expressive language, receive more shares. Similarly, subjective writing, where author’s present opinions or emotional perspectives, also correlates with higher engagement, likely because such content feels more relatable and authentic to readers.

When examining keyword metrics, two variables stood out as the most influential:

kw_avg_avg — representing the average popularity of all keywords in an article, reflecting overall topic appeal.
kw_max_avg — capturing the strongest or trendiest keyword within an article, showing viral potential tied to a single trending topic.

Articles that scored high on both measures saw significantly more shares, demonstrating that using popular or high-traffic topics boosts visibility.

Since these variables showed the strongest and most stable correlations with article shares, the team selected them as core predictive features for the modeling phase in Question 6.

In summary, positive tone, expressive style, and keyword relevance are key elements of viral content.

This finding bridges the exploratory analysis to predictive modeling, providing a data-driven rationale for which factors best forecast article success.

Question 6 — Can we predict article popularity before publication?

Business Context:

If editors could estimate the popularity of an article before publishing, they could strategically allocate resources by prioritizing high-affected stories, optimizing headlines, and tailoring content strategy around engagement potential.

To analyze this problem, researchers reframed it as a predictive analytics task, testing whether metadata and pre-publication features (like sentiment, keywords, and structure) can meaningfully forecast share counts.

Approach:

Model 1: Regression — Predict the continuous number of shares using Random Forest Regressor.
Model 2: Classification — Label each article as viral or non-viral using the median shares threshold, then predict the class with a Random Forest Classifier.
Evaluate model performance using key metrics:
- Regression: R², MAE, RMSE
- Classification: Accuracy, F1-score, and ROC-AUC
Analyze feature importance to identify which article attributes most influence shareability.

	Metric	Value
0	R²	0.019
1	MAE	3,030.389
2	RMSE	10,691.588

	Metric	Value
0	Accuracy	0.642
1	F1 Score	0.634
2	ROC-AUC	0.686

	Predicted: Non-Viral	Predicted: Viral
Actual: Non-Viral	3291	1730
Actual: Viral	1815	3075

Insight Summary:

The regression model achieved an R² of 0.019, with an MAE around 3,030 and an RMSE near 10,692, showing that predicting exact share counts is highly uncertain. Audience engagement remains influenced by external and unpredictable factors (like news cycles and platform dynamics).

However, the classification model performed more reliably, reaching ~64% accuracy, F1 = 0.63, and ROC-AUC = 0.69, meaning it can correctly distinguish “viral” from “non-viral” articles more often than random chance. The confusion matrix shows balanced results between the two classes, confirming that the model generalizes moderately well.

Feature importance analysis highlights that keyword popularity metrics (kw_avg_avg, kw_max_avg), content length, and sentiment polarity are the most predictive variables. These features together capture both what the article is about and how it communicates — two key drivers of audience engagement.

Overall, while predicting the exact number of shares remains difficult, predictive classification offers valuable editorial guidance. Editors can use such models to assess potential virality before publication, enabling data-driven decisions in content planning, promotion timing, and SEO optimization.

Key Predictive Features Identified

Rank	Feature	Interpretation
1	`kw_avg_avg`	Average popularity of used keywords; high values indicate topics readers already engage with.
2	`kw_max_avg`	Strongest keyword appeal; articles with trending keywords attract more attention.
3	`n_tokens_content`	Longer, in-depth articles tend to get shared more.
4	`global_sentiment_polarity`	Positive tone enhances reader resonance.
5	`num_hrefs`	Moderate external linking boosts credibility and engagement.

Conclusion:

Even basic machine learning models help us understand things. News organizations can improve their content before publishing by figuring out what makes something popular.

--- title: "Online News Popularity Analysis — Predicting Article Shares Before Publication" author: "Christopher Legarda" date: today jupyter: python3 # Page + theme format: html: theme: cosmo toc: true toc-location: left toc-depth: 3 number-sections: false code-fold: true code-summary: "Show code" code-tools: true df-print: paged smooth-scroll: true anchor-sections: true fig-width: 8 fig-height: 5 fig-align: center tbl-cap-location: top fig-cap-location: bottom # Uncomment if/when you want PDF or Word: # pdf: # documentclass: scrreprt # toc: true # number-sections: true # docx: # toc: true # Execution controls execute: echo: false include: false warning: false message: false cache: true freeze: auto # re-run only when code changes # Nice title banner title-block-banner: true page-layout: full # (Optional) params for quick filtering, etc. # params: # year_min: 2020 # year_max: 2024 --- ```{python} import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns pd.set_option("display.max_columns", 61) pd.set_option("display.width", 120) sns.set(style="whitegrid", palette="muted") ``` ```{python} file_path = r"C:\Users\Christopher\Documents\Python Projects\Online_News_Popularity\OnlineNewsPopularity.csv" df = pd.read_csv(file_path) print(f"Shape of the dataset is {df.shape}") df.head(10) ```  ```{python} df.info() ``` ```{python} df.describe().T.head(10) ```  ```{python} df.columns = df.columns.str.strip() df['shares'].describe() plt.figure(figsize=(9,5)) sns.histplot(df["shares"], bins=100, kde=False) plt.title("Distribution of Shares") plt.xlabel("Shares") plt.ylabel("Count") plt.show() # Log-transform for skew plt.figure(figsize=(9,5)) sns.histplot(np.log1p(df["shares"]), bins=100, kde=True) plt.title("Distribution of log(1 + Shares)") plt.xlabel("log(1 + Shares)") plt.ylabel("Count") plt.show() ```  ```{python} plt.figure(figsize=(12,6)) sns.histplot(np.log1p(df["shares"]), bins=100, kde=True) # Better labels plt.title("Distribution of Article Shares (log-scaled)") plt.xlabel("Number of Shares") plt.ylabel("Number of Articles") # Replace log ticks with raw share numbers ticks = [4, 6, 8, 10, 12] # log values labels = [f"{int(np.expm1(tick)):,}" for tick in ticks] # back-transform to raw shares plt.xticks(ticks, labels) plt.show() ``` ```{python} plt.figure(figsize=(12,6)) sns.histplot(np.log1p(df["shares"]), bins=100, kde=True) plt.title("Distribution of Article Shares (log-scaled)") plt.xlabel("Number of Shares") plt.ylabel("Number of Articles") # Choose more tick positions (log values) ticks = np.arange(2, 14, 1) # from 4 to 12, step = 1 labels = [f"{int(np.expm1(tick)):,}" for tick in ticks] # back-transform plt.xticks(ticks, labels, rotation=45) # rotate to avoid overlap plt.show() ``` ```{python} import numpy as np def articles_near_share(df, share_value, bins=100): """ Given a share count, return its log1p value and the number of articles in the same histogram bin. """ log_val = np.log1p(share_value) # Create histogram bins on log scale counts, bin_edges = np.histogram(np.log1p(df["shares"]), bins=bins) # Find which bin the log_val falls into bin_idx = np.digitize(log_val, bin_edges) - 1 # Guard against edge cases if bin_idx < 0 or bin_idx >= len(counts): return log_val, 0, (None, None) # Bin range (low, high) for reference bin_range = (bin_edges[bin_idx], bin_edges[bin_idx+1]) return log_val, counts[bin_idx], bin_range ``` ```{python} log_val, count, bin_range = articles_near_share(df, 2000, bins=100) print(f"log1p(2000) = {log_val:.2f}") print(f"Articles in this bin: {count}") print(f"Bin covers log range {bin_range[0]:.2f} – {bin_range[1]:.2f}") ```  ```{python} # Temporarily disable truncation pd.set_option("display.max_rows", None) missing_values = df.isna().sum().sort_values(ascending=False) print(missing_values) # Reset to default (so later outputs don’t spam your screen) pd.reset_option("display.max_rows") ``` ```{python} if "url" in df.columns: print("Duplicate URLs:", df["url"].duplicated().sum()) ``` ```{python} df.duplicated().sum() # full row duplicates ``` # 1. Summary This project analyzes the *Online News Popularity* dataset from Kaggle to create a business-focused report. I selected this specific dataset to show my data analysis skills towards more business orientated reporting, oppose to community impact/social impact reporting, where this report focuses on data-driven insights to support business decisions. This dataset has over 39,000 online articles; features such as content structure, keyword performance, sentiment, and the engagement score (measured via social media shares) characterize each article. In this report, I’ll try to convert raw engagement data into actionable business insights by integrating exploratory visualization, feature analysis, and predictive modeling, which could help digital media organizations improve their content planning, publishing, and positioning. # 2. Business Problem & Analytical Questions In digital media, article engagement is the main metric of its success. Understanding why certain articles “go viral” while others do not can help organizations make informed decisions and plan advertising strategies and editorial resource allocation. This project aims to explore the factors that influence online engagement and provide insights into how publishers can increase shareability, visibility, and reader retention. The report seeks to identify data-driven strategies for improving content performance and to evaluate the possibility of predicting an article’s virality before its publication. To achieve this, the analysis is guided by six key business questions: | # | Business Question | Focus Area | |---|--------------------|-------------| | **1.** | What types of content drive the most engagement? | Topic & category performance | | **2.** | Does article length affect popularity? | Content structure & readability | | **3.** | How do timing and recency matter? | Publishing schedule optimization | | **4.** | What role do visuals and links play? | Multimedia & SEO strategy | | **5.** | Do sentiment and keywords influence sharing? | Tone, emotion, and keyword strength | | **6.** | Can we predict article popularity before publication? | Predictive modeling for editorial planning | Together, these questions form a **data-to-decision framework** that mirrors the real-world analytics process — from raw data exploration to business insights and predictive forecasting. # 3. Data Overview & Methodology ## 3.1 Data Source & Description The dataset used in this analysis is the **Online News Popularity** dataset, originally published by **Mashable** and made available through the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Online+News+Popularity) and on [Kaggle](https://www.kaggle.com/datasets/thehapyone/uci-online-news-popularity-data-set). In this downloadable dataset, it contains 39,644 rows (online news articles), collected between January 2013 and December 2014, where each labeled with metadata, content features, and social media engagement metrics (measured by the total number of shares on platforms like Facebook, LinkedIn, and Google+). Each row in the dataset represents a single article, and each column captures an attribute describing aspects of its content, publication timing, sentiment, and keyword performance. This dataset provides an opportunity to explore business focus data analysis by identifying engagement patterns to inform editorial strategy, SEO optimization, and content marketing decisions. ## 3.2 Key Feature Categories The table below shows features that are grouped into meaningful categories that reflect editorial and marketing considerations, making the data more interpretable for business stakeholders. | Feature Category | Example Columns | Business Interpretation | |------------------|-----------------|--------------------------| | **Content Length** | `n_tokens_title`, `n_tokens_content` | Title and article word counts — indicate readability and information depth. | | **Multimedia Richness** | `num_imgs`, `num_videos` | Number of images and videos — measure visual engagement level. | | **SEO & Linking Strategy** | `num_hrefs`, `num_self_hrefs` | Counts of external and internal hyperlinks — reflect how optimized an article is for search and cross-navigation. | | **Keyword Performance** | `kw_min_avg`, `kw_max_avg`, `kw_avg_avg` | Aggregated keyword popularity metrics — proxy for topic relevance and trend appeal. | | **Sentiment & Tone** | `global_sentiment_polarity`, `global_subjectivity`, `title_sentiment_polarity`, `title_subjectivity` | Emotional tone and objectivity of both the article and its title. | | **Timing Features** | `weekday_is_*`, `is_weekend`, `timedelta` | Publication day and recency — reveal engagement patterns over time. | | **Engagement Outcome** | `shares` | The number of times an article was shared — the main measure of popularity and performance. | ## 3.3 Analytical Approach This report uses a set method to go from finding patterns to providing useful information. Below is a list of analytical techniques that are commonly found in business analytics that I’ve implemented in this report: 1. Descriptive Analytics–Examine historical data to identify how content type, sentiment, visuals, and timing relate to engagement. 2. Diagnostic Analytics–Explore why certain articles perform better using correlation analysis, segmentation, and visual storytelling. 3. Predictive Analytics–Build a simple model to evaluate whether pre-publication article features can predict shareability or “virality.” I conducted the analysis using Python on Jupyter Notebook, including libraries like ‘pandas’ and ‘numpy’ for data manipulation, ‘matplotlib’ and ‘seaborn’ for visualization, and ‘scikit-learn’ for statistical modeling and prediction. Since this report targets non-technical and business audiences, it emphasizes the findings and insights rather than the code implementation itself. Before analysis, I’ve performed all data preparation steps, such as handling missing values, cleaning, and transformations, to ensure data accuracy and consistency. # 4. Exploratory Data Analysis (EDA & Insights) This section is the main starting point of the report, which explores through six of the business-driven analytical questions that I’ve mentioned above. Each subsection below presents visuals, descriptive statistics, and summarized insights. ## Question 1 — What types of content drive the most engagement? **Business Context:** Understanding which content categories perform best helps editorial teams prioritize topics that consistently attract higher engagement. This analysis examines which Mashable sections (e.g., Lifestyle, Business, Technology, etc.) receive the most shares. **Approach:** - Group articles by `channel` (e.g., Lifestyle, Business, Tech, etc.) - Compare their median and distribution of `shares` - Visualize with bar and box plots to identify engagement patterns ```{python} # 1) Identify the 6 channel columns and coerce to numeric 0/1 channel_cols = [c for c in df.columns if c.startswith("data_channel_is_")] df[channel_cols] = df[channel_cols].fillna(0).astype(int) # 2) Row-wise sum to detect uncategorized or (rare) multi-labeled rows row_sum = df[channel_cols].sum(axis=1) # 3) Assign channel only for rows with exactly one active flag; else mark Uncategorized channel_raw = df[channel_cols].idxmax(axis=1).str.replace("data_channel_is_", "") channel_raw = np.where(row_sum == 0, "uncategorized", channel_raw) # all zeros channel_raw = np.where(row_sum > 1, "uncategorized", channel_raw) # safety for ties # 4) Map to presentation names channel_mapping = { "lifestyle": "Lifestyle", "entertainment": "Entertainment", "bus": "Business", "socmed": "Social Media", "tech": "Technology", "world": "World", "uncategorized": "Uncategorized" } df["channel"] = pd.Series(channel_raw).str.lower().map(channel_mapping) # 5) Sanity checks print("Rows with all zeros:", (row_sum == 0).sum()) print("Rows with >1 channels:", (row_sum > 1).sum()) print(df["channel"].value_counts(dropna=False)) ``` ```{python} df ``` ```{python} channel_summary = ( df.groupby("channel")["shares"] .agg(["count", "median", "mean"]) .sort_values("median", ascending=False) ) channel_summary ``` ```{python} import numpy as np # target on log-scale for robust comparisons df["log_shares"] = np.log1p(df["shares"]) channel_summary = ( df.groupby("channel", observed=True)["shares"] .agg(count="size", median="median", mean="mean") .sort_values("median", ascending=False) ) channel_summary ``` ```{python} # order = channel_summary.index.tolist() # plt.figure(figsize=(10,6)) # ax = sns.barplot(data=df, x="channel", y="shares", estimator=np.median, order=order, ci=95) # # Title and labels # ax.set_title("Typical Article Shares by Content Channel") # ax.set_xlabel("Channel") # ax.set_ylabel("Median Shares") # # Choose y-ticks (adjust range depending on your data spread) # max_val = df.groupby("channel")["shares"].median().max() # yticks = np.linspace(0, max_val, 8) # 8 evenly spaced ticks # ax.set_yticks(yticks) # ax.set_yticklabels([f"{int(t):,}" for t in yticks]) # # --- Add data labels --- # medians = df.groupby("channel")["shares"].median().reindex(order) # for i, (cat, val) in enumerate(medians.items()): # ax.text(i, val + (0.02 * max_val), f"{int(val):,}", # place slightly above bar # ha="center", va="bottom", fontsize=9, fontweight="bold") # plt.tight_layout() # plt.show() ``` ```{python} #| echo: false #| include: true order = channel_summary.index.tolist() plt.figure(figsize=(10,6)) ax = sns.barplot(data=df, x="channel", y="shares", estimator=np.median, order=order, ci=None) # Title and labels ax.set_title("Typical Article Shares by Content Channel") ax.set_xlabel("Channel") ax.set_ylabel("Median Shares") # Choose y-ticks max_val = df.groupby("channel")["shares"].median().max() yticks = np.linspace(0, max_val, 8) ax.set_yticks(yticks) ax.set_yticklabels([f"{int(t):,}" for t in yticks]) # --- Add data labels --- medians = df.groupby("channel")["shares"].median().reindex(order) for i, (cat, val) in enumerate(medians.items()): ax.text(i, val + (0.02 * max_val), f"{int(val):,}", ha="center", va="bottom", fontsize=9, fontweight="bold") plt.tight_layout() plt.show() ``` ```{python} plt.figure(figsize=(12,6)) ax = sns.boxplot( data=df, x="channel", y="shares", order=order, showfliers=False # hide extreme outliers so boxes are easier to read ) ax.set_title("Distribution of Article Shares by Content Channel") ax.set_xlabel("Channel") ax.set_ylabel("Shares") # Format y-ticks with commas yticks = ax.get_yticks() ax.set_yticklabels([f"{int(t):,}" for t in yticks]) plt.tight_layout() plt.show() ``` ```{python} def boxplot_stats(x): q1 = x.quantile(0.25) q3 = x.quantile(0.75) iqr = q3 - q1 # Tukey's fences (1.5× IQR rule) lower_whisker = x[x >= (q1 - 1.5 * iqr)].min() upper_whisker = x[x <= (q3 + 1.5 * iqr)].max() return pd.Series({ "Q0 (whisker)": lower_whisker, "Q1 (25th)": q1, "Median (Q2)": x.median(), "Q3 (75th)": q3, "Q4 (whisker)": upper_whisker, "Min": x.min(), "Max": x.max() }) # Apply per channel channel_box_stats = df.groupby("channel")["shares"].apply(boxplot_stats).unstack() # Format with commas channel_box_stats = channel_box_stats.applymap(lambda v: f"{int(v):,}") channel_box_stats ``` **Insight Summary:** We found that Social Media articles drive the highest engagement, with a median of ~2,100 shares and strong viral upside. Right after Social Media will be Uncategorized with a median of 1,900. Lifestyle and Technology follow closely, offering reliable performance. Business and World articles generate lower engagement, even though they are more stable. This suggests that editorial resources should weight toward Social Media, Uncategorized, Lifestyle, and Tech if you want to maximize reach. ## Question 2 — Does article length affect popularity? **Business Context:** Article length can influence how readers engage with certain content. Shorter articles may seem attractive to people who want to have a quick read, while longer articles can deliver more depth and detail. This analysis will attempt to determine whether title length and content length (measured by estimated reading time) influence how often people share an article. **Approach:** - Analyze the distribution of `n_tokens_title` and `n_tokens_content` to understand writing patterns. - Convert content length into estimated reading time, assuming an average reading speed of 200 words per minute. - Group articles into title length and reading time bins, then calculate median shares within each group to detect engagement trends. ```{python} print("\n---Statistical description on `n_tokens_title`---\n") df["n_tokens_title"].describe() ``` ```{python} print("\n---Statistical description on `n_tookens_content`---\n") df["n_tokens_content"].describe() ``` ```{python} #| echo: false #| include: true plt.figure(figsize=(12,6)) sns.histplot(df["n_tokens_title"], bins=30, kde=True) plt.title("Distribution of Title Length (words)") plt.xlabel("Title word count") plt.ylabel("Number of articles") plt.show() plt.figure(figsize=(12,6)) sns.histplot(df["n_tokens_content"], bins=50, kde=True) plt.title("Distribution of Content Length (words)") plt.xlabel("Content word count") plt.ylabel("Number of articles") plt.show() ``` ```{python} plt.figure(figsize=(12,6)) sns.scatterplot(data=df, x="n_tokens_title", y="shares", alpha=0.3) sns.regplot(data=df, x="n_tokens_title", y="shares", scatter=False, color="red") plt.title("Shares vs Title Length") plt.xlabel("Title word count") plt.ylabel("Number of shares per article") plt.ylim(0, 20000) plt.show() plt.figure(figsize=(10,6)) sns.scatterplot(data=df, x="n_tokens_content", y="shares", alpha=0.3) sns.regplot(data=df, x="n_tokens_content", y="shares", scatter=False, color="red") plt.title("Shares vs Content Length") plt.xlabel("Content word count") plt.ylabel("Shares") plt.ylim(0, 20000) # cap for readability plt.show() ``` ```{python} df["n_tokens_title"].describe() ``` ```{python} df["n_tokens_content"].describe() ``` ```{python} # Title bins df["title_len_bin"] = pd.cut( df["n_tokens_title"], bins=[0,5,10,15,20,60], labels=["0–5","6–8","9–12","13–15","16+"] ) # Content bins (convert to approx read time at 200 wpm) df["read_time_min"] = df["n_tokens_content"] / 200 df["content_len_bin"] = pd.cut( df["read_time_min"], bins=[0,2,5,10,20,60], labels=["0–2 min","3–5 min","6–10 min","11–20 min","20+ min"] ) ``` ```{python} title_summary = df.groupby("title_len_bin")["shares"].median() content_summary = df.groupby("content_len_bin")["shares"].median() print("Median shares by title length bin:\n", title_summary) print("\nMedian shares by content length bin:\n", content_summary) ``` ```{python} #| echo: false #| include: true plt.figure(figsize=(12,6)) sns.barplot(x=title_summary.index, y=title_summary.values) plt.title("Median Shares by Title Length") plt.xlabel("Title word count range") plt.ylabel("Median Shares") plt.show() plt.figure(figsize=(12,6)) sns.barplot(x=content_summary.index, y=content_summary.values) plt.title("Median Shares by Content Length (Reading Time)") plt.xlabel("Estimated Reading Time") plt.ylabel("Median Shares") plt.show() ``` ```{python} plt.figure(figsize=(12,6)) sns.boxplot( data=df, x="title_len_bin", y="shares", hue="channel", showfliers=False ) plt.title("Shares by Title Length (split by Channel)") plt.ylabel("Shares") plt.xlabel("Title word count bin") plt.legend(bbox_to_anchor=(1.05,1), loc="upper left") plt.show() ``` **Insight Summary:** The first chart shows that the title lengths follow a roughly normal distribution, centered on ten words, showing that most editors in the dataset usually around that amount of title length. However, most articles are relatively short, right under 2,000 words, and content length is highly skewed to the right. We can also see in the same chart that a tiny fraction of the articles extends over 2,000 (higher content length). When comparing the two engagement bar charts, we can see that Median Shares by Title Length containing sixteen or more words generated the highest median shares, showing that more words in the title section attract readers’ attention and encourage sharing. Similarly, in the Median Shares by Content Length (Reading Time), the chart reads that articles that take over twenty minutes to read also achieved the highest median share counts. This can suggest that readers are more likely to share articles that have more words and details, possibly these articles convey unique insights. However, since such long articles are relatively rare, the results likely reflect the higher quality and editorial investment in these pieces rather than length alone. ## Question 3 — How do timing and recency matter? **Business Context:** In news media, timing affects visibility, its competition and user behavior. Identifying when engagement peaks can help optimize content scheduling. **Approach:** - Analyze `weekday_is_*` and `is_weekend` variables - Compare median shares by weekday and weekend - Evaluate `timedelta` (days since publication) to understand recency impact ```{python} print("\n---Statistical Summary For `timedelta`---\n") df["timedelta"].describe() ``` ```{python} plt.figure(figsize=(10,5)) sns.scatterplot(data=df, x="timedelta", y="shares", alpha=0.3) from statsmodels.nonparametric.smoothers_lowess import lowess lowess_smoothed = lowess(df["shares"], df["timedelta"], frac=0.05) plt.plot(lowess_smoothed[:,0], lowess_smoothed[:,1], color="red", linewidth=2) plt.title("Shares vs Article Recency (timedelta)") plt.xlabel("Days since publication (timedelta)") plt.ylabel("Shares") plt.ylim(0, 20000) plt.show() ``` ```{python} weekday_cols = [c for c in df.columns if c.startswith("weekday_is_")] weekday_map = { "weekday_is_monday": "Monday", "weekday_is_tuesday": "Tuesday", "weekday_is_wednesday": "Wednesday", "weekday_is_thursday": "Thursday", "weekday_is_friday": "Friday", "weekday_is_saturday": "Saturday", "weekday_is_sunday": "Sunday" } weekday_df = df[weekday_cols] df["weekday"] = weekday_df.idxmax(axis=1).map(weekday_map) ``` ```{python} weekday_summary = ( df.groupby("weekday")["shares"] .median() .sort_values(ascending=False) ) print("Median shares by weekday:\n", weekday_summary) ``` ```{python} #| echo: false #| include: true plt.figure(figsize=(12,6)) sns.barplot(data=df, x="weekday", y="shares", estimator=np.median, order=weekday_summary.index, ci=None) plt.title("Median Shares by Weekday") plt.xlabel("Day of the Week") plt.ylabel("Median Shares") plt.tight_layout() plt.show() ``` ```{python} #| echo: false #| include: true # Dataset already has is_weekend column weekend_summary = df.groupby("is_weekend")["shares"].median() weekend_summary.index = ["Weekday","Weekend"] plt.figure(figsize=(10,5)) sns.barplot(x=weekend_summary.index, y=weekend_summary.values) plt.title("Median Shares: Weekday vs Weekend") plt.xlabel("") plt.ylabel("Median Shares") plt.show() ``` **Insight Summary:** The analysis shows a clear temporal pattern in audience engagement. Articles published on the weekends significantly outperform those released during the weekdays, with Saturday reaching the highest median share count (around 2,000) and Sunday following closely behind. Looking at the Median Shares: Weekday vs. Weekend, we can see that weekday articles average between 1,300 and 1,500 median shares, suggesting lower reader interaction compared to the weekend, where articles average between 1,800 and 2,000. This pattern implies that readers are more likely to discover and share content during the weekend compared to traditional working hours. For editorial teams, these findings suggest that scheduling major stories for feature articles for weekend publication could improve shareability and overall reach. ## Question 4 — What role do visuals and links play? **Business Context:** Different visuals and hyperlinks play a key role in how readers engage with online content. For example, images and videos can enhance storytelling and emotional connections, while hyperlinks can improve credibility, depth and SEO performance. **Approach:** - Examine the distributions of `num_imgs`, `num_videos`, `num_hrefs`, and `num_self_hrefs`. - Compare median shares across binned groups (e.g., 0, 1–2, 3–5, 6–10, 10+). - Use Spearman correlation to measure non-linear relationships between these variables and `shares`. - Identify whether richer multimedia or higher hyperlink counts consistently align with stronger engagement. ```{python} visual_cols = ["num_imgs", "num_videos", "num_hrefs", "num_self_hrefs"] df[visual_cols].describe().T ``` ```{python} fig, axes = plt.subplots(2, 2, figsize=(12,8)) for ax, col in zip(axes.flat, visual_cols): sns.histplot(df[col], bins=50, ax=ax) ax.set_title(f"Distribution of {col}") ax.set_xlim(0, 50) plt.tight_layout() plt.show() ``` ```{python} for col in visual_cols: corr = df[col].corr(df["shares"], method="spearman") print(f"Spearman correlation between {col} and shares: {corr:.3f}") ``` ```{python} # Define bins df["img_bin"] = pd.cut(df["num_imgs"], bins=[-1,0,2,5,10,100], labels=["0","1–2","3–5","6–10","10+"]) df["video_bin"] = pd.cut(df["num_videos"], bins=[-1,0,1,3,10,100], labels=["0","1","2–3","4–10","10+"]) df["href_bin"] = pd.cut(df["num_hrefs"], bins=[-1,5,10,20,40,1000], labels=["0–5","6–10","11–20","21–40","40+"]) df["self_href_bin"] = pd.cut(df["num_self_hrefs"], bins=[-1,0,2,5,10,100], labels=["0","1–2","3–5","6–10","10+"]) ``` ```{python} #| echo: false #| include: true # Plot median shares for each bin fig, axes = plt.subplots(2, 2, figsize=(12,8)) sns.barplot(data=df, x="img_bin", y="shares", estimator=np.median, ax=axes[0,0], ci=None) axes[0,0].set_title("Median Shares by # Images") sns.barplot(data=df, x="video_bin", y="shares", estimator=np.median, ax=axes[0,1], ci=None) axes[0,1].set_title("Median Shares by # Videos") sns.barplot(data=df, x="href_bin", y="shares", estimator=np.median, ax=axes[1,0], ci=None) axes[1,0].set_title("Median Shares by # Hyperlinks") sns.barplot(data=df, x="self_href_bin", y="shares", estimator=np.median, ax=axes[1,1], ci=None) axes[1,1].set_title("Median Shares by # Self-References") for ax in axes.flat: ax.set_ylabel("Median Shares") ax.set_xlabel("") plt.tight_layout() plt.show() ``` **Insight Summary:** The analysis above shows that multimedia and linking result in high shareability. Looking at the graph Median Shares by # Images, it shows that 10 and more images or around 20 through 40 hyperlinks reveal higher median shares, suggesting that visual and reference links enhance readers' willingness to share the articles. Content quality and contextual relevance of visuals and links matter more than their sheer quantity, guiding editors to favor well-curated, purposeful multimedia integration over volume. ## Question 5 — Do sentiment and keywords influence sharing? **Business Context:** For this question, I will be answering how certain keywords and sets affect the emotional tone of the article, which drives online engagement. Articles that have the right emotional balance, or align with the trend, are often associated with high-performing keywords, and are more likely to attract attention and sharing. This analysis explores how sentiment *polarity*, *subjectivity*, and keyword strength contribute to article popularity. **Approach:** - Evaluate sentiment variables (`global_sentiment_polarity`, `title_sentiment_polarity`, `global_subjectivity`, `title_subjectivity`) to measure tone and emotional intensity. - Examine keyword-based metrics (`kw_min_avg`, `kw_max_avg`, `kw_avg_avg`) that represent how popular or widely shared those keywords are across the entire dataset. - Compare median shares across binned sentiment and keyword quartiles to identify which emotional tones and topic strengths are most associated with engagement. ```{python} #| echo: false #| include: true kw_cols = ["kw_min_avg", "kw_max_avg", "kw_avg_avg"] # Summary stats print(df[kw_cols].describe().T) # Correlations with shares for col in kw_cols: corr = df[col].corr(df["shares"], method="spearman") print(f"Spearman correlation between {col} and shares: {corr:.3f}") # Histograms fig, axes = plt.subplots(1, 3, figsize=(15,4)) for ax, col in zip(axes, kw_cols): sns.histplot(df[col], bins=50, ax=ax) ax.set_title(f"Distribution of {col}") plt.tight_layout() plt.show() ``` ```{python} #| echo: false #| include: true for col in kw_cols: df[f"{col}_bin"] = pd.qcut(df[col], 4, labels=["Q1 (low)","Q2","Q3","Q4 (high)"]) plt.figure(figsize=(8,4)) sns.barplot(data=df, x=f"{col}_bin", y="shares", estimator=np.median, ci=None) plt.title(f"Median Shares by {col} Quartile") plt.ylabel("Median Shares") plt.xlabel(col) plt.show() ``` ```{python} #| echo: false #| include: true # Polarity: [-1, 1], Subjectivity: [0, 1] sentiment_cols = [ "global_sentiment_polarity", "global_subjectivity", "title_sentiment_polarity", "title_subjectivity", ] for col in sentiment_cols: if "subjectivity" in col: bins = [0.0, 0.25, 0.50, 0.75, 1.0] labels = ["0–0.25", "0.25–0.5", "0.5–0.75", "0.75–1.0"] s = df[col].clip(0, 1) # safety xlabel = f"{col} (quartile-like fixed bins)" else: bins = [-1.0, -0.25, 0.0, 0.25, 1.0] labels = ["Very neg (≤-0.25)", "Slight neg (-0.25–0)", "Slight pos (0–0.25)", "Very pos (≥0.25)"] s = df[col].clip(-1, 1) # safety xlabel = f"{col} (fixed polarity bins)" df[f"{col}_bin"] = pd.cut(s, bins=bins, labels=labels, include_lowest=True) plt.figure(figsize=(8,4)) sns.barplot(data=df, x=f"{col}_bin", y="shares", estimator=np.median, ci=None) plt.title(f"Median Shares by {col} bin") plt.ylabel("Median Shares") plt.xlabel(xlabel) plt.tight_layout() plt.show() ``` **Insight Summary:** The analysis confirms that both emotional tone and keyword popularity have meaningful effects on shareability. Articles with a positive sentiment polarity, especially those written with optimistic or emotionally expressive language, receive more shares. Similarly, subjective writing, where author’s present opinions or emotional perspectives, also correlates with higher engagement, likely because such content feels more relatable and authentic to readers. When examining keyword metrics, two variables stood out as the most influential: - `kw_avg_avg` — representing the average popularity of all keywords in an article, reflecting overall topic appeal. - `kw_max_avg` — capturing the strongest or trendiest keyword within an article, showing viral potential tied to a single trending topic. Articles that scored high on both measures saw significantly more shares, demonstrating that using popular or high-traffic topics boosts visibility. Since these variables showed the strongest and most stable correlations with article shares, the team selected them as core predictive features for the modeling phase in Question 6. In summary, positive tone, expressive style, and keyword relevance are key elements of viral content. This finding bridges the exploratory analysis to predictive modeling, providing a data-driven rationale for which factors best forecast article success. ## Question 6 — Can we predict article popularity before publication? **Business Context:** If editors could estimate the popularity of an article before publishing, they could strategically allocate resources by prioritizing high-affected stories, optimizing headlines, and tailoring content strategy around engagement potential. To analyze this problem, researchers reframed it as a predictive analytics task, testing whether metadata and pre-publication features (like sentiment, keywords, and structure) can meaningfully forecast share counts. **Approach:** - Model 1: **Regression** — Predict the continuous number of `shares` using Random Forest Regressor. - Model 2: **Classification** — Label each article as *viral* or *non-viral* using the median shares threshold, then predict the class with a Random Forest Classifier. - Evaluate model performance using key metrics: - **Regression:** R², MAE, RMSE - **Classification:** Accuracy, F1-score, and ROC-AUC - Analyze **feature importance** to identify which article attributes most influence shareability. ```{python} import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression, LogisticRegression from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, accuracy_score, f1_score, roc_auc_score, confusion_matrix from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier import matplotlib.pyplot as plt import seaborn as sns ``` ```{python} features = [ "n_tokens_title", "n_tokens_content", "num_imgs", "num_videos", "num_hrefs", "num_self_hrefs", "kw_avg_avg", "kw_max_avg", "global_sentiment_polarity", "global_subjectivity", "title_sentiment_polarity", "title_subjectivity", "channel", "weekday", "is_weekend" ] target = "shares" ``` ```{python} df_model = df[features + [target]].dropna().copy() ``` ```{python} numeric_features = [ "n_tokens_title","n_tokens_content","num_imgs","num_videos", "num_hrefs","num_self_hrefs","kw_avg_avg","kw_max_avg", "global_sentiment_polarity","global_subjectivity", "title_sentiment_polarity","title_subjectivity","is_weekend" ] categorical_features = ["channel","weekday"] preprocess = ColumnTransformer([ ("num", StandardScaler(), numeric_features), ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features) ]) ``` ```{python} X = df_model[features] y = df_model["shares"] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) reg_pipeline = Pipeline([ ("prep", preprocess), ("model", LinearRegression()) ]) reg_pipeline.fit(X_train, y_train) y_pred = reg_pipeline.predict(X_test) ``` ```{python} #| echo: false #| include: true import pandas as pd metrics = { "Metric": ["R²", "MAE", "RMSE"], "Value": [r2_score(y_test, y_pred), mean_absolute_error(y_test, y_pred), np.sqrt(mean_squared_error(y_test, y_pred))] } pd.DataFrame(metrics).style.format({"Value": "{:,.3f}"}) ``` ```{python} median_shares = df_model["shares"].median() df_model["is_viral"] = (df_model["shares"] > median_shares).astype(int) ``` ```{python} X = df_model[features] y = df_model["is_viral"] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y) ``` ```{python} clf_pipeline = Pipeline([ ("prep", preprocess), ("model", LogisticRegression(max_iter=1000)) ]) clf_pipeline.fit(X_train, y_train) y_pred = clf_pipeline.predict(X_test) y_prob = clf_pipeline.predict_proba(X_test)[:,1] ``` ```{python} #| echo: false #| include: true import pandas as pd from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix # --- Compute metrics --- accuracy = accuracy_score(y_test, y_pred) f1 = f1_score(y_test, y_pred) roc_auc = roc_auc_score(y_test, y_prob) cm = confusion_matrix(y_test, y_pred) # --- Create summary table --- metrics = { "Metric": ["Accuracy", "F1 Score", "ROC-AUC"], "Value": [accuracy, f1, roc_auc] } metrics_df = pd.DataFrame(metrics).style.format({"Value": "{:.3f}"}) display(metrics_df) # --- Show confusion matrix separately --- cm_df = pd.DataFrame(cm, index=["Actual: Non-Viral", "Actual: Viral"], columns=["Predicted: Non-Viral", "Predicted: Viral"]) display(cm_df) ``` ```{python} #| echo: false #| include: true rf_clf = Pipeline([ ("prep", preprocess), ("model", RandomForestClassifier(n_estimators=200, random_state=42)) ]) rf_clf.fit(X_train, y_train) # Get feature importances feature_names = ( numeric_features + list(rf_clf.named_steps["prep"].named_transformers_["cat"].get_feature_names_out(categorical_features)) ) importances = rf_clf.named_steps["model"].feature_importances_ feat_imp = pd.DataFrame({"feature": feature_names, "importance": importances}).sort_values("importance", ascending=False) plt.figure(figsize=(8,6)) sns.barplot(y="feature", x="importance", data=feat_imp.head(15)) plt.title("Top Predictive Features for Viral Articles") plt.tight_layout() plt.show() ``` **Insight Summary:** The regression model achieved an R² of 0.019, with an MAE around 3,030 and an RMSE near 10,692, showing that predicting exact share counts is highly uncertain. Audience engagement remains influenced by external and unpredictable factors (like news cycles and platform dynamics). However, the classification model performed more reliably, reaching ~64% accuracy, F1 = 0.63, and ROC-AUC = 0.69, meaning it can correctly distinguish “viral” from “non-viral” articles more often than random chance. The confusion matrix shows balanced results between the two classes, confirming that the model generalizes moderately well. Feature importance analysis highlights that keyword popularity metrics (`kw_avg_avg`, `kw_max_avg`), content length, and sentiment polarity are the most predictive variables. These features together capture both what the article is about and how it communicates — two key drivers of audience engagement. Overall, while predicting the exact number of shares remains difficult, predictive classification offers valuable editorial guidance. Editors can use such models to assess potential virality before publication, enabling data-driven decisions in content planning, promotion timing, and SEO optimization. ### Key Predictive Features Identified | Rank | Feature | Interpretation | |------|----------|----------------| | 1 | `kw_avg_avg` | Average popularity of used keywords; high values indicate topics readers already engage with. | | 2 | `kw_max_avg` | Strongest keyword appeal; articles with trending keywords attract more attention. | | 3 | `n_tokens_content` | Longer, in-depth articles tend to get shared more. | | 4 | `global_sentiment_polarity` | Positive tone enhances reader resonance. | | 5 | `num_hrefs` | Moderate external linking boosts credibility and engagement. | **Conclusion:** Even basic machine learning models help us understand things. News organizations can improve their content before publishing by figuring out what makes something popular.