Political Bias Detection in News Text:
Linear Regression vs. BERT

NLP · BERT · HDBSCAN · SimCSE · 2024

Overview

Comparative study of linear regression vs. BERT embeddings for detecting political bias in news articles. Using HDBSCAN clustering on SimCSE-encoded text, we reveal that political lean has meaningful geometric structure in embedding space, structure that bag-of-words and linear models fail to capture. The 3D t-SNE visualization shows partisan separation is real but noisy: left-leaning and right-leaning articles form distinct but overlapping clusters, with hard-to-classify centrist or mixed-bias text occupying the overlap zone.

Model Comparison

Approach	Embedding	Strengths	Limitations
Linear Regression	TF-IDF / BoW	Fast, interpretable coefficients, strong on unigram bias markers	Misses semantic context; "liberal" in conservative article confuses model
BERT + HDBSCAN	SimCSE contextual	Captures framing, tone, semantic context; finds structure linear model misses	Slower, less interpretable, cluster boundaries are soft

3D HDBSCAN cluster visualization of political bias

Figure 1. 3D t-SNE projection of SimCSE embeddings with HDBSCAN cluster assignments. Blue = Cluster −1 (noise/centrist). Red = Cluster 1 (right-leaning). Overlap in the center reflects genuinely ambiguous political framing.

Key Takeaways

BERT

Better overall

Contextual embeddings consistently outperform BoW on bias classification — semantic framing matters more than vocabulary.

Noisy

Cluster boundaries

Partisan separation is real but soft — centrist and mixed-bias articles form a genuine overlap that no model cleanly resolves.

HDBSCAN

No k required

Density-based clustering avoids the need to pre-specify k, and naturally identifies noise/outliers as a separate class.

Back to start →

All Work

← Home

Political Bias Detection in News Text:Linear Regression vs. BERT

Political Bias Detection in News Text:
Linear Regression vs. BERT