Political Bias Detection in News Text:
Linear Regression vs. BERT

NLP · BERT · HDBSCAN · SimCSE · 2024
GitHub Report

Overview

Comparative study of linear regression vs. BERT embeddings for detecting political bias in news articles. Using HDBSCAN clustering on SimCSE-encoded text, we reveal that political lean has meaningful geometric structure in embedding space, structure that bag-of-words and linear models fail to capture. The 3D t-SNE visualization shows partisan separation is real but noisy: left-leaning and right-leaning articles form distinct but overlapping clusters, with hard-to-classify centrist or mixed-bias text occupying the overlap zone.


Model Comparison
ApproachEmbeddingStrengthsLimitations
Linear RegressionTF-IDF / BoWFast, interpretable coefficients, strong on unigram bias markersMisses semantic context; "liberal" in conservative article confuses model
BERT + HDBSCANSimCSE contextualCaptures framing, tone, semantic context; finds structure linear model missesSlower, less interpretable, cluster boundaries are soft
3D HDBSCAN cluster visualization of political bias

Figure 1. 3D t-SNE projection of SimCSE embeddings with HDBSCAN cluster assignments. Blue = Cluster −1 (noise/centrist). Red = Cluster 1 (right-leaning). Overlap in the center reflects genuinely ambiguous political framing.


Key Takeaways
BERT
Better overall
Contextual embeddings consistently outperform BoW on bias classification — semantic framing matters more than vocabulary.
Noisy
Cluster boundaries
Partisan separation is real but soft — centrist and mixed-bias articles form a genuine overlap that no model cleanly resolves.
HDBSCAN
No k required
Density-based clustering avoids the need to pre-specify k, and naturally identifies noise/outliers as a separate class.
Back to start →
All Work
← Home