Project Overview
This project explores the intersection of statistics and linguistics in political media coverage.
The main objective is to analyze how politicians are described in media narratives and whether personal
attributes such as age, gender, and ethnicity are associated with the adjectives used to portray them.
The project focuses on adjective usage in descriptions of political figures from the Politico 28 Class of 2023
and related media text from the NOW Corpus. The analysis combines linguistic categorization with statistical
testing and machine learning classification.
Research Motivation
Political media narratives influence how the public perceives political leaders. Adjectives such as
“competent”, “emotional”, “defiant”, “energetic”, “far-right”, or “experienced” can frame a politician’s
identity and shape public interpretation of leadership, credibility, ideology, and character.
Main research question: How do politicians’ personal attributes such as age, gender,
and ethnicity influence the adjectives used to describe them in media reports?
Data Sources
The analysis is based on two main data sources. The first source is Politico Europe’s Politico 28 Class of 2023,
which profiles influential figures in European politics, policy, and culture. The second source is the News on the Web
Corpus, a large corpus of online news text used to examine broader media language patterns.
- Politico 28 Class of 2023: used for targeted political profiles and manually extracted descriptive adjectives.
- NOW Corpus: used for broader media context and adjective usage in online news discourse.
- Manual categorization: adjectives were mapped to tags such as competence, emotion, politics, character, age, and appearance.
Dataset
The dataset combines politician-level attributes with adjective-level linguistic information. Each row represents
an adjective used to describe a politician, together with metadata such as gender, age, ethnicity, token frequency,
and semantic tag category.
| Variable |
Description |
Example |
| ADJ |
Adjective used in media descriptions |
defiant, energetic, experienced |
| Name |
Name of the politician |
Volodymyr Zelenskyy |
| Gender |
Gender of the politician |
male / female |
| Age |
Age of the politician |
44, 54, 67 |
| Ethnicity |
Ethnic or national background |
Ukrainian, German, French |
| Tokens |
Frequency of the adjective in the dataset |
9 for “defiant” |
| Tag |
Semantic category of the adjective |
competence, emotion, politics, character |
Problem Definition
The project investigates whether patterns in adjective usage are associated with demographic and political attributes.
The analysis asks whether younger and older politicians receive different descriptions, whether gender is associated
with competence or emotion-related adjectives, and whether ethnicity influences descriptive language patterns.
A secondary modelling task uses age and adjective tags to predict gender, not because gender prediction itself is the
final goal, but because the model can reveal which linguistic and demographic variables carry predictive information.
Statistical Methods
The project combines exploratory data analysis with statistical testing and interpretable machine learning. The goal is
not only prediction, but also interpretation of linguistic and demographic patterns in media descriptions.
| Method |
Purpose |
Interpretation |
| Exploratory Data Analysis |
Inspect distributions of gender, ethnicity, age, and adjective tags |
Understand dataset structure and possible imbalance |
| Welch Two Sample t-Test |
Compare group means under unequal variance assumptions |
Used to test differences in adjective frequencies |
| Density Plots |
Visualize distributional differences |
Useful for comparing descriptor frequency across groups |
| Decision Tree |
Predict gender using age and adjective tags |
Interpretable model for linguistic-demographic patterns |
| Confusion Matrix |
Evaluate classification performance |
Shows correct and incorrect gender predictions |
| MSE / RMSE / R² |
Measure model prediction error and explanatory power |
Summarize model quality and limitations |
Exploratory Analysis
The dataset contains a diverse set of politicians with different ages, genders, and ethnic backgrounds. The age range
spans roughly from the mid-40s to the early 70s, and the ethnicity distribution includes Ukrainian, German, French,
American, Turkish, Estonian, and other backgrounds.
- French ethnicity appeared most frequently in the analyzed politician dataset.
- The gender distribution showed a male majority, but with substantial female representation.
- Tags such as politics, character, age, emotion, competence, and status were used to classify adjectives.
- Gender-based tag comparisons suggested different descriptor patterns for male and female politicians.
Statistical Testing
Welch t-tests were used to compare adjective-frequency patterns across groups. One analysis suggested that the
descriptor “competence” appeared more frequently for male politicians than for female politicians in the dataset.
t = 3.70
Welch t-test statistic for competence descriptor frequency across gender.
p = 0.00057
Statistically significant difference in competence descriptor frequency.
p = 0.680
No significant age-group difference for emotion descriptor usage.
These findings suggest potential gender-related differences in how competence is linguistically attributed, while
age did not clearly explain the use of emotion-related descriptors in this dataset.
Decision Tree Model
A decision tree model was used to predict gender based on age and adjective tag variables. The goal was to produce
an interpretable model that reveals which features contribute most to gender-related linguistic patterns.
The decision tree initially split politicians by age group, indicating that age carried predictive information in the
model. Subsequent splits used linguistic tag features such as character, politics, and competence. This helped identify
how demographic variables and adjective categories interacted in the dataset.
Feature Interpretation
- Age: strongest feature in the final model interpretation.
- Character tags: important for separating some gender groups.
- Politics tags: captured ideological and role-related descriptions.
- Competence tags: reflected ability, leadership, and performance-related descriptions.
Model Evaluation
The final decision tree model was evaluated using a confusion matrix and common classification metrics. The confusion
matrix shows that the model correctly classified many cases but still made false positive and false negative predictions,
especially due to limited data size and overlapping linguistic patterns.
Precision = 0.85
High precision for the positive class in the final reported model.
Recall = 0.80
The model captured a large proportion of positive-class examples.
R² = 0.2849
Moderate explanatory power with room for improvement.
Additional Metrics
- MSE: 0.1643.
- RMSE: 0.4054.
- Feature importance: age contributed the largest share in the reported model.
Key Findings
The analysis suggests that media descriptions of politicians are not random. Some adjective categories appear to be
associated with demographic and political attributes, especially gender and age. However, due to the limited dataset size,
the results should be interpreted as exploratory rather than definitive.
- Competence descriptors were more frequent for male politicians in the analyzed dataset.
- Emotion descriptors did not show a statistically significant difference between age groups.
- Decision tree results suggested that age and adjective tag categories contain useful predictive information.
- The model achieved good precision and recall, but explanatory power remained moderate.
- The project highlights how language can reflect or reinforce media framing and political stereotypes.
Limitations
The main limitation is dataset size. Politico 28 is a small and curated sample, so conclusions cannot be generalized
to all political media coverage. Some adjective categories were manually assigned, which can introduce subjectivity.
Additionally, token frequency may be influenced by media attention rather than only by personal characteristics.
- Small sample size based on a curated group of political figures.
- Manual adjective tagging may introduce classification bias.
- NOW Corpus context may vary by source, country, and publication style.
- Decision tree models are interpretable but sensitive to small data changes.
- Results should be treated as exploratory evidence, not causal proof of media bias.
Future Work
Future work could expand the dataset, include more political figures, add more news sources, and improve linguistic
feature extraction using NLP methods. More advanced models could be compared with the decision tree while preserving
interpretability.
- Use larger corpora and more political profiles.
- Automate adjective extraction with NLP pipelines.
- Apply sentiment analysis and contextual embeddings.
- Compare decision trees with random forests, logistic regression, and transformer-based models.
- Validate findings across countries, media outlets, and time periods.
Outcome
This project strengthened my ability to combine statistical analysis with text-based data and linguistic interpretation.
It demonstrates practical experience with data cleaning, feature categorization, exploratory visualization, hypothesis
testing, decision tree modelling, and model evaluation in a social-science context.
It is a useful portfolio project because it connects statistics, machine learning, NLP-style feature engineering, and
political media analysis in one interpretable workflow.
Statistics Meets Linguistics
Political Media
NLP
Adjective Analysis
Decision Tree
Welch t-Test
Confusion Matrix
Python
Jupyter Notebook