✅ Week 11 - Lab Solutions
Predicting true statements using the LIAR data set
This solution file follows the format of a Jupyter Notebook file .ipynb
you had to fill in during the lab session.
👉 NOTE: We wanted to flag that this solution is far from optimal. Rather, it is more to serve as a demonstration of what kinds of analyses can be performed using this data set using the tools studied in DS202 (supervised learning part) or that extend them slightly (unsupervised part).
⚙️ Setup
Downloading the student solutions
Click on the below button to download the student notebook.
Loading libraries
import numpy as np
import pandas as pd
import spacy
from scipy.stats import mode
from sklearn.preprocessing import OneHotEncoder
from gensim.models.ldamodel import LdaModel
from gensim.models.coherencemodel import CoherenceModel
from sklearn.decomposition import LatentDirichletAllocation
import matplotlib.pyplot as plt
from gensim.models import Phrases
from gensim.models.phrases import Phraser
from gensim.corpora import Dictionary
from collections import Counter
from itertools import chain
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import confusion_matrix, f1_score, recall_score, precision_score
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.decomposition import TruncatedSVD
import umap
from sklearn.cluster import SpectralClustering,AgglomerativeClustering
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from kmedoids import KMedoids
import plotly.express as px
from gower import gower_matrix
from tqdm import tqdm
import pyLDAvis
import networkx as nx
from scipy.sparse import csr_matrix
from skfuzzy.cluster import cmeans
from lets_plot import *
LetsPlot.setup_html()import warnings
'ignore')
warnings.filterwarnings(# Load spaCy stopwords
= spacy.load("en_core_web_sm")
nlp = nlp.Defaults.stop_words
stopwords %matplotlib inline
import os
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
Load the data set
We will start by loading liar
and inspecting the distribution of the target. We find that approximately 42% of samples contain true statements, as defined in the Week 11 Lab Roadmap.
# Read the .csv
= pd.read_csv("../data/liar-dataset-cleaned-v2.csv")
liar
# How is our target distributed?
"true_statement").to_frame()
liar.value_counts(
true_statement count0 False 3304
1 True 2436
Feature Selection: Statements
We’ll start by extracting features from the statements. The code for doing this is largely similar to the code provided in Week 10, but with two differences:
- We are creating n-grams, specifically unigrams and bigrams as features.
- We have specified a miminum document frequency of 15 to cut down on the number of features.
# Preprocess function: tokenize, remove punctuation, numbers, and stopwords
def preprocess_text(text):
= nlp(text.lower()) # Lowercase text
doc = [
tokens for token in doc
token.lemma_ if not token.is_punct and not token.is_digit and not token.is_space
and token.text.lower() not in stopwords
]return " ".join(tokens) # Return cleaned text as a string
# Apply preprocessing to each review
"statement_cleaned"] = liar["statement"].apply(preprocess_text)
liar[
# Create document-feature matrix (DFM) using CountVectorizer
= CountVectorizer(min_df=15, ngram_range=(1,2))
vectorizer = vectorizer.fit_transform(liar["statement_cleaned"])
dfm
# Convert to a DataFrame for inspection
= pd.DataFrame(dfm.toarray(), columns=vectorizer.get_feature_names_out()) dfm
Feature Selection: Contexts
Statements are made in many contexts, which might influence the statement maker’s propensity to tell the truth. We will build code that rationalises context
by lumping all contexts that appear in 10 or less statements into an “other” category. To work with scikit-learn
, we can transform this column into a one-hot encoded series of dummy variables.
# Create a count for each context
= liar.value_counts("context")
counts
# Set the threshold
= 10
threshold
# Create a variable that transforms the context column
"context_lumped"] = liar["context"].apply(lambda x: x if pd.notna(x) and counts.get(x, 0) >= threshold else "other")
liar[
# Convert the column to dummies
= pd.get_dummies(liar["context_lumped"], columns=["context_lumped"], prefix="context") contexts
Feature Selection: Subjects
As some subjects are more contentious than others, this might also influence a public figure’s propensity to tell the truth. subjects
has already been transformed into dummies, so all we need to do is isolate them.
= liar[liar.columns[liar.columns.str.contains("subj_")]] subjects
Constructing the training and testing set
# Concatenate the features into one set of features
= pd.concat([dfm,subjects,contexts], axis=1)
X
# Isolate the target
= liar["true_statement"]
y
# Perform the train test split
= train_test_split(X, y, stratify=y, random_state=123) X_train, X_test, y_train, y_test
Supervised Learning: Penalised logistic regression
Let’s try out some different combinations of hyperparameters. We will build a dictionary that has different kinds of penalties (l1
= Lasso, l2
= Ridge), along with different levels of penalty.
# Create a dictionary of hyperparameter choices
= {"penalty":["l1","l2"], "C": [0.5,0.1,0.01,0.001]}
pl_params
# Instantiate a lasso classification model
= LogisticRegression(solver="liblinear",max_iter=1000,random_state=123)
logit_classifier
# Instantiate a Grid Search
= GridSearchCV(logit_classifier, param_grid=pl_params, scoring="f1", cv=10)
pl_grid
= pl_grid.fit(X_train, y_train) _
Below, we can visualise which hyperparameter combination words best. We can see that at high levels of regularisation, Ridge tends to outperform Lasso considerably. However, when regularisation is relaxed, Lasso starts to slightly outperform Ridge.
("param_C", "mean_test_score", color = "param_penalty")) +
ggplot(pd.DataFrame(pl_grid.cv_results_), aes(+
geom_point() = "dashed") +
geom_line(linetype =element_blank()) +
theme(panel_grid_major_x= "Regularisation level", y = "Mean F1-Score", color = "Penalty Type",
labs(x = "A Lasso model with middling regularisation scores highest!")
title )
We can now apply our “optimal” hyperparameter combinations to the whole of the training set.
# Instantiate a lasso classification model
= pl_grid.best_estimator_
lasso_classifier
# Fit the model to the training data
= lasso_classifier.fit(X_train, y_train) _
To peak “underneath the hood” of the Lasso model, we can explore which factors tend to predict truthful / untruthful statements. Unsurprisingly, subjects dealing with Facebook Fact-Checks and Fake News are the most indicative of untruthful statements. We see that topics such as education and the economy, however, are most indicative of truthful statements.
# Create a data frame of the top 20 features
= (
top_20_feats "features": lasso_classifier.feature_names_in_,
pd.DataFrame({"coefs": lasso_classifier.coef_.reshape(-1)})
= lambda x: np.abs(x["coefs"]),
.assign(abs_coefs = lambda x: np.where(x["coefs"] > 0, "Positive", "Negative"))
sign "abs_coefs")
.sort_values(20)
.tail(
)
# Plot the output
("abs_coefs", "features", fill = "sign")) +
ggplot(top_20_feats, aes(= "identity", tooltips=layer_tooltips().line("Abs. coef. value: @abs_coefs").line("@sign")) +
geom_bar(stat =element_blank()) +
theme(panel_grid_major_y= "Absolute coefficient value", y = "Features", fill = "Influence")
labs(x )
We can now apply our insights to the test set, to see how well our “optimal” model performs. We have an F1-score of 0.66, which is better than flipping a coin. Let’s see if we can get better performance using an XGBoost.
# Apply class predictions to the test set
= lasso_classifier.predict(X_test)
predictions
# f1-score
round(f1_score(y_test, predictions), 2)
np.
0.66
Supervised Learning: XGBoost
With XGBoost
, we will vary the proportion of features sampled when building each tree in the model and the learning rate. To speed things up, we are going to take advantage of parallel processing by setting n_jobs = -1
which lets the algorithm run across
# Create a dictionary of hyperparameter values to try
= {"n_estimators": [1000], "colsample_bytree":[0.3,0.6,0.9], "learning_rate": [0.001,0.01,0.1]}
xgb_params
# Instantiate an XGBoost classifier, utilising all cores in your laptop
= XGBClassifier(n_jobs = -1)
xgb_classifier
# Create a 10-fold cross-validation algorithm
= GridSearchCV(xgb_classifier, param_grid=xgb_params, scoring="f1", cv=10)
xgb_grid
# Fit the algorithm to the training data
= xgb_grid.fit(X_train, y_train) _
We see that an XGBoost
that uses 90% of features to build each decision tree and a learning rate of 0.01 produces the best results out of all the hyperparameter combinations we have tried.
"param_colsample_bytree"] = xgb_grid.cv_results_["param_colsample_bytree"].astype(str)
xgb_grid.cv_results_[
("param_learning_rate", "mean_test_score", color = "param_colsample_bytree")) +
ggplot(pd.DataFrame(xgb_grid.cv_results_), aes(=layer_tooltips().line("@mean_test_score")) +
geom_point(tooltips= "dashed") +
geom_line(linetype ="bottom",
theme(legend_position= element_blank()) +
panel_grid_major_x= "Learning rate", y = "Mean F-1 Score", color = "Proportion of features\nused in each tree",
labs(x = "An XGBoost that uses 90% of features and a learning rate of 0.01\nproduces the highest score!")
title )
After having evaluated the data on the test set, we find that our F1-score improves by ~ 2 percentage points - noticeable, but there is still obviously room for improvement.
# Pick the best XGBoost
= xgb_grid.best_estimator_
xgb_classifier
# Fit the best model to the training data
= xgb_classifier.fit(X_train, y_train)
_
# Apply class predictions to the test set
= xgb_classifier.predict(X_test)
predictions
# f1-score
round(f1_score(y_test, predictions), 2)
np.
0.68
Unsupervised learning approaches
In this part, we’ll extract insights from the LIAR datasets using a couple of different unsupervised techniques:
- clustering
- anomaly detection
- topic modeling
But, before we do that, we’ll pre-process the data again as the pre-processing here is a bit different from the pre-processing used in the supervised learning part (note that you could also have used the pre-processing performed here in the supervised learning part). The main reason for this pre-processing is to enhance the results of topic modeling (LDA performed on data pre-processed as in the supervised learning part doesn’t yield particularly meaningful results for this dataset!).
Pre-processing
= liar["statement"].astype(str).tolist()
texts def spacy_preprocess(texts):
= []
processed
for doc in nlp.pipe(texts, batch_size=500):
= []
tokens
for token in doc:
if (
or
token.is_punct or
token.like_num or
token.is_space not token.is_alpha
):continue
# Keep all named entities intact (e.g., "Affordable Care Act")
if token.ent_type_ in {"PERSON", "ORG", "GPE", "LAW", "NORP", "EVENT"}:
tokens.append(token.text.lower())continue
# Keep only relevant parts of speech: NOUN, VERB, ADJ, ADV
if token.pos_ in {"NOUN", "VERB", "ADJ", "ADV"}:
= token.lemma_.lower()
lemma
tokens.append(lemma)
processed.append(tokens)
return processed
# Run preprocessor
= spacy_preprocess(texts)
tokenized_texts
# Train bigram and trigram models
= Phrases(tokenized_texts, min_count=3, threshold=5)
bigram_model = Phrases(bigram_model[tokenized_texts], threshold=5)
trigram_model
= Phraser(bigram_model)
bigram_phraser = Phraser(trigram_model)
trigram_phraser
# Apply phrase detection
= [bigram_phraser[doc] for doc in tokenized_texts]
texts_bigrams = [trigram_phraser[bigram_phraser[doc]] for doc in tokenized_texts]
texts_trigrams
# Final cleaned version
= [[w for w in doc if 2 < len(w) < 25] for doc in texts_trigrams] final_texts
The goal of the pre-processing is to clean and structure political statements for downstream analysis — without stripping away politically meaningful expressions like “tax cut”, “Donald Trump”, “Affordable Care Act”. Here’s what’s happening:
1️⃣ Step 1: Load LIAR data
= liar["statement"].astype(str).tolist() texts
The dataset is loaded and converted to a list of strings (statement column).
2️⃣ Step 2 (within the spacy_preprocess
function): spaCy Preprocessing Function
def spacy_preprocess(texts):
= []
processed
for doc in nlp.pipe(texts, batch_size=500):
= []
tokens for token in doc:
- Batch processes texts using
spaCy
for speed. - Initializes a list of
tokens
for eachdoc
(text).
3️⃣ Step 3 (within the spacy_preprocess
function): Remove non-informative tokens
if (
or
token.is_punct or
token.like_num or
token.is_space not token.is_alpha
):continue
Remove (i.e skip) punctuation, digits, whitespace, and anything that isn’t a proper word (is_alpha
).
4️⃣ Step 4 (within the spacy_preprocess
function): Preserve Named Entities (Important!)
if token.ent_type_ in {"PERSON", "ORG", "GPE", "LAW", "NORP", "EVENT"}:
tokens.append(token.text.lower())continue
- If a token is part of a named entity (e.g., a law, person, organization, political event), we keep it as-is, lowercased.
- This ensures phrases like “Affordable Care Act” are preserved in the next step.
5️⃣ Step 5 (within the spacy_preprocess
function): Filter POS tags (keep only meaningful words)
if token.pos_ in {"NOUN", "VERB", "ADJ", "ADV"}:
= token.lemma_.lower()
lemma tokens.append(lemma)
- Keeps only nouns, verbs, adjectives, and adverbs.
- Lemmatizes them (e.g., “running” → “run”, “better” → “good”).
6️⃣ Step 6: Generate Bigrams and Trigrams
= Phrases(tokenized_texts, min_count=3, threshold=5)
bigram_model = Phrases(bigram_model[tokenized_texts], threshold=5)
trigram_model
= Phraser(bigram_model)
bigram_phraser = Phraser(trigram_model)
trigram_phraser
= [bigram_phraser[doc] for doc in tokenized_texts]
texts_bigrams = [trigram_phraser[bigram_phraser[doc]] for doc in tokenized_texts] texts_trigrams
- Learns common bigrams like
["tax", "cut"] → "tax_cut"
and trigrams like["affordable", "care", "act"] → "affordable_care_act"
. threshold=5
ensures only semi-frequent phrases are merged.
7️⃣ Step 7: Final Cleanup
= [[w for w in doc if 2 < len(w) < 25] for doc in texts_trigrams] final_texts
Filters out tokens that are too short (like “a”, “it”) or too long (usually garbage).
Why is this pre-processing suitable for political text such as the LIAR dataset?
- We’re only removing punctuation, numbers, spaces, and non-alphabetic characters, while retaining important content.
- In particular, we keep named entities in their original form (just lowercased) rather than lemmatizing them, which maintains their recognizable identity. And we’ve included named entity types that are highly relevant in political discourse.
- We made a sensible part-of-speech selection. By including nouns, verbs, adjectives, and adverbs, we capture the key content words that convey meaning and sentiment in political statements.
- The addition of bigram and trigram detection is particularly valuable for political text, as it will capture important phrases like “health_care_reform” or “tax_policy” rather than treating the individual words separately.
- The final step of removing very short (≤2 character) and very long (≥25 character) tokens helps eliminate potential noise while keeping meaningful content.
This approach should preserve the important semantic content needed for analyzing political statements, including entities, topics, and sentiment-bearing words, while still providing useful normalization and noise reduction.
We’ll now construct a TF-IDF DFM.
= [" ".join(tokens) for tokens in final_texts]
docs
# TF-IDF: preserve important phrases by analyzing words and bigrams
= TfidfVectorizer(
vectorizer =0.5,
max_df=5,
min_df=(1, 3), # can also try (2, 3) for stricter phrase focus
ngram_range
)= vectorizer.fit_transform(docs) X_tfidf
LSA for dimensionality reduction
We’ll use LSA to reduce dimensions before applying clustering or anomaly detection.
Let’s first start by choosing the number of components for LSA.
There are several methods to determine the number of components for LSA (some of which can also be used for LDA).
1. Explained Variance (LSA-specific)
Method | Explained Variance (via TruncatedSVD) |
---|---|
Use Case | Latent Semantic Analysis (LSA) |
Metric | Cumulative variance explained by components |
How It Works | Selects the smallest number of components needed to explain a threshold (e.g., 90%) of variance in the TF-IDF matrix |
Pros | Intuitive, quick to compute, gives a rough dimensionality estimate |
Cons | May not reflect semantic coherence of topics, not meaningful for LDA or NMF |
Outcome in Our Case | No clear elbow or plateau was observed. Explained variance increased gradually without a strong inflection point. (see plot below) |
Click to view the code
def plot_explained_variance(X_tfidf, max_components=100, thresholds=[0.4, 0.5, 0.6, 0.7, 0.8]):
"""
Plot cumulative explained variance from TruncatedSVD for LSA, and mark thresholds if they are reached.
Returns a dictionary mapping thresholds to number of components needed (if reached).
"""
= min(max_components, X_tfidf.shape[1] - 1)
n_components = TruncatedSVD(n_components=n_components, random_state=42)
svd
svd.fit(X_tfidf)
= svd.explained_variance_ratio_
explained_variance = np.cumsum(explained_variance)
cumulative_variance
= cumulative_variance[-1]
max_variance print(f"🔍 Max cumulative explained variance: {max_variance:.3f}")
=(10, 6))
plt.figure(figsizerange(1, n_components + 1), cumulative_variance, color='#40BCD8', linestyle='-', label='Cumulative Explained Variance')
plt.plot(True)
plt.grid('Number of Components')
plt.xlabel('Cumulative Explained Variance')
plt.ylabel('Explained Variance by Number of LSA Components')
plt.title(
= {}
threshold_components = ['#1C77C3', '#39A9DB', '#F39237', '#D63230', '#D63230']
colors
for i, threshold in enumerate(thresholds):
if threshold <= max_variance:
= np.argmax(cumulative_variance >= threshold) + 1
n_required = n_required
threshold_components[threshold]
=threshold, color=colors[i % len(colors)], linestyle='--',
plt.axhline(y=f'{int(threshold * 100)}% Explained Variance')
label=n_required, color=colors[i % len(colors)], linestyle=':',
plt.axvline(x=f'{n_required} Components for {int(threshold * 100)}%')
labelelse:
print(f"⚠️ Threshold {threshold} not reached (max = {max_variance:.2f})")
# Optional: mark max variance level
=max_variance, color='#D63230', linestyle='--', alpha=0.3,
plt.axhline(y=f'Max Reached: {max_variance:.2f}')
label
plt.legend()
plt.tight_layout()
plt.show()
return threshold_components
=100, thresholds=[0.2,0.3,0.35] plot_explained_variance(X_tfidf, max_components
2. AIC/BIC (Information Criteria)
Method | AIC / BIC (Adapted from PCA-style residual reconstruction error) |
---|---|
Use Case | LSA (non-probabilistic) - adapted approach |
Metric | Tradeoff between model fit (reconstruction error) and complexity |
How It Works | Penalizes models with more parameters to avoid overfitting |
Pros | Formal criterion for model selection |
Cons | Not designed for SVD or non-generative models, results may be noisy/inconsistent in high-dimensional sparse text |
Outcome in Our Case | No clear minima detected in AIC/BIC curves; insufficient for topic selection here (see plot below) |
Click to view the code
def compute_information_criteria(X_tfidf, component_range):
= X_tfidf.shape
n_samples, n_features = [], []
aic_scores, bic_scores
= X_tfidf.toarray()
X_tfidf_dense
for n_components in component_range:
= TruncatedSVD(n_components=n_components, random_state=42)
svd = svd.fit_transform(X_tfidf)
X_trans = np.dot(X_trans, svd.components_)
X_approx = np.sum((X_tfidf_dense - X_approx) ** 2)
rss = n_components * (n_features + 1)
k
= n_samples * np.log(rss / n_samples) + 2 * k
aic = n_samples * np.log(rss / n_samples) + k * np.log(n_samples)
bic
aic_scores.append(aic)
bic_scores.append(bic)
=(10, 6))
plt.figure(figsize='AIC')
plt.plot(component_range, aic_scores, label='BIC')
plt.plot(component_range, bic_scores, label'Components')
plt.xlabel('Score')
plt.ylabel('AIC and BIC vs Number of Components')
plt.title(
plt.legend()True)
plt.grid(
plt.tight_layout()
plt.show()
= component_range[np.argmin(aic_scores)]
best_aic = component_range[np.argmin(bic_scores)]
best_bic
return best_aic, best_bic
compute_information_criteria(X_tfidf,component_range)
3. Topic Stability (LSA/LDA/NMF-compatible)
Method | Topic Stability (Jaccard similarity across runs) |
---|---|
Use Case | Any model where randomness influences output |
Metric | Average pairwise similarity of top terms per topic across multiple model runs |
Pros | Robust, directly measures semantic consistency |
Cons | Requires multiple model fits; computationally more expensive |
Outcome in Our Case | This metric shows a sharp decline until about [50] components then stabilizes with a slight uptick after 150 components (see outcome of code chunk below) |
4. Topic Quality (Diversity & Exclusivity)
Method | Topic Quality Metrics |
---|---|
Use Case | Any topic model |
Metrics | Diversity: unique words across topics; Exclusivity: words unique to single topics |
Pros | Directly measures topic interpretability and separability |
Cons | Somewhat heuristic, though intuitive and easy to interpret |
Outcome in Our Case | Clear “elbow” observed between [50-75] components for both diversity and exclusivity, indicating high-quality topics (see outcome of code chunk below) |
We’ll use topic stability and topic quality (i.e exclusivity+diversity) to select the number of LSA component in this particular case.
Why topic stability + topic quality?
In our case, traditional quantitative metrics like explained variance and AIC/BIC did not yield clear-cut or meaningful selection criteria. Instead, we observed visually and quantitatively well-defined optima (see the results of the code chunk below) using:
- Topic Stability: A strong indicator of model consistency across different initializations.
- Topic Quality (Diversity + Exclusivity): Highlighted a point where topics are both distinct and interpretable.
This dual-criterion approach is particularly well-suited to exploratory or unsupervised text analysis, where interpretability and robustness matter more than pure statistical fit.
Higher stability (Jaccard similarity) indicates more consistent topics across different runs, while quality metrics tell you how interpretable those topics are. When selecting the optimal number of components for LSA, we’re looking for:
- Stability (Jaccard similarity) as high as possible - indicating that our topic assignments are consistent across different runs rather than randomly changing.
- Quality metrics (Diversity and Exclusivity) that have reached an elbow point - where adding more components gives diminishing returns.
In our graphs (see results of code chunk below), around 50-75 components seems to be where both conditions start to be met. After this point:
- Stability levels off and even slightly improves
- Quality metrics stop their steep decline and flatten out
This “elbow point” represents a sweet spot where we have enough components to capture meaningful patterns in our data, but not so many that we’re just modeling noise or creating unstable, overlapping topics. A lower component count (e.g 10) might give higher absolute stability scores but at the cost of poor quality metrics, while very high component counts don’t meaningfully improve any of the metrics.
We choose 70 here.
= list(range(1, 51))+list(range(60, 201, 10)) #iteration range (step of 1 from 1 to 50 and step of 10 between 50 and 200).
fine_grained_range
# Extract topics from SVD
def get_topics_from_svd(svd_model, feature_names, n_top_words=10):
= np.argsort(svd_model.components_, axis=1)[:, -n_top_words:]
word_indices return np.array(feature_names)[word_indices][:, ::-1]
# 1. Topic Stability
def evaluate_stability(X_tfidf, feature_names, component_range, n_runs=5, n_top_words=10):
= []
stability_scores for n_components in component_range:
= []
runs_topics for seed in range(n_runs):
= TruncatedSVD(n_components=n_components, random_state=seed)
svd
svd.fit(X_tfidf)= get_topics_from_svd(svd, feature_names, n_top_words)
topics
runs_topics.append(topics)
= []
sim_scores for i in range(n_runs):
for j in range(i + 1, n_runs):
= [
sim len(set(t1) & set(t2)) / len(set(t1) | set(t2))
for t1, t2 in zip(runs_topics[i], runs_topics[j])
]
sim_scores.append(np.mean(sim))
stability_scores.append(np.mean(sim_scores))
=(10, 6))
plt.figure(figsize'm-o')
plt.plot(component_range, stability_scores, 'Components')
plt.xlabel('Avg Jaccard Similarity')
plt.ylabel('Topic Stability Across Runs')
plt.title(True)
plt.grid(
plt.tight_layout()
plt.show()
return stability_scores
# 2. Topic Quality
def calculate_topic_metrics(X_tfidf, feature_names, component_range, n_top_words=10):
= [], []
diversity_scores, exclusivity_scores
for n_components in component_range:
= TruncatedSVD(n_components=n_components, random_state=42)
svd
svd.fit(X_tfidf)= get_topics_from_svd(svd, feature_names, n_top_words)
topics
= [word for topic in topics for word in topic]
all_words = len(set(all_words)) / len(all_words)
diversity
= Counter(all_words)
counts = sum(1 for v in counts.values() if v == 1) / len(counts)
exclusivity
diversity_scores.append(diversity)
exclusivity_scores.append(exclusivity)
=(10, 6))
plt.figure(figsize='Diversity', color='blue')
plt.plot(component_range, diversity_scores, label='Exclusivity', color='red')
plt.plot(component_range, exclusivity_scores, label'Components')
plt.xlabel('Score')
plt.ylabel('Topic Quality Metrics')
plt.title(
plt.legend()True)
plt.grid(
plt.tight_layout()
plt.show()
return diversity_scores, exclusivity_scores
# 3. Component Analysis function
def analyze_components(X_tfidf, final_texts, vectorizer, component_range):
= vectorizer.get_feature_names_out()
feature_names = Dictionary(final_texts)
dictionary
print("\n1. Evaluating Topic Stability...")
= evaluate_stability(X_tfidf, feature_names, component_range)
stability_scores
print("\n2. Evaluating Topic Quality...")
= calculate_topic_metrics(X_tfidf, feature_names, component_range)
diversity_scores, exclusivity_scores
print("\n=== Analysis Complete ===")
print("Based on the plots, manually set the optimal number of components.")
return None
# Run the analysis to generate plots
analyze_components(X_tfidf, final_texts, vectorizer, fine_grained_range)
# After viewing the plots, set your chosen number of components here:
= 70 # Change this value based on your analysis of the plots
selected_components
print(f"\nSelected number of components: {selected_components}")
1. Evaluating Topic Stability...
2. Evaluating Topic Quality...
=== Analysis Complete ===
set the optimal number of components.
Based on the plots, manually
70 Selected number of components:
After the number of components (“topics”) have been selected, we’re ready to run LSA.
= TruncatedSVD(n_components=selected_components, random_state=42)
svd = svd.fit_transform(X_tfidf)
X_lsa = [f'lsa_{i}' for i in range(selected_components)] # lsa_0, lsa_1, ..., lsa_n-1
lsa_cols = pd.DataFrame(X_lsa, columns=lsa_cols) X_lsa_df
Clustering
For this part, we’ll combine both the matrix obtained with LSA and metadata from the original liar
dataframe (the one-hot encoded subject columns i.e the columns prefixed with subj_
, the column with information about the speaker i.e speaker
, the date column i.e date
and the column that contains information about how true the statement is i.e perc_true
).
Before we proceed, let’s have a quick look at missing values in the metadata.
= ['speaker', 'date', 'perc_true']
selected_columns
# Dynamically add all columns starting with 'subj_'
= [col for col in liar.columns if col.startswith('subj_')]
subj_columns
# Combine the initial selection with the dynamically selected columns
= selected_columns + subj_columns
final_columns = liar[final_columns].isna().sum()
missing_counts
# Convert to DataFrame for better visualization
= missing_counts.reset_index()
missing_summary = ['Column', 'Missing Values']
missing_summary.columns missing_summary
Column | Missing Values | |
---|---|---|
0 | speaker | 7 |
1 | date | 7 |
2 | perc_true | 0 |
3 | subj_government_regulation | 0 |
4 | subj_polls_and_public_opinion | 0 |
… | … | … |
177 | subj_katrina | 0 |
178 | subj_ohio | 0 |
179 | subj_nbc | 0 |
180 | subj_georgia | 0 |
181 | subj_missouri | 0 |
182 rows × 2 columns
There are a few missing values for speaker and date (and not obvious way to impute them). So we’ll just drop those rows before we proceed with clustering.
# Identify valid (non-missing) rows based on speaker and date
= liar["speaker"].notna() & liar["date"].notna()
valid_rows
# Filter liar and X_lsa accordingly
= liar[valid_rows].copy()
liar_clean =True, inplace=True) liar_clean.reset_index(drop
We drop the rows from the LSA matrix that correspond to missing values of speaker and date.
= X_lsa_df[valid_rows].copy()
X_lsa_clean_df
# Reset the index for consistency
=True, inplace=True) X_lsa_clean_df.reset_index(drop
Now, we’re ready to pre-process the metadata before going on with clustering:
- the
subj_
columns are already one-hot encoded and don’t require further pre-processing - similarly,
perc_true
is already scaled (it’s in a scale between 0-1) and doesn’t require further pre-processing - for speakers, we only keep the top 20 speakers and replace the values of the others by “Other” (this prevents having data that is too sparse) before one-hot encoding the
speaker
column - for the
date
column, we only keep the year, bin years into discrete intervals (e.g., 1999-2005, 2005-2010, etc.) then encode these intervals as numerical labels for modeling
# 1. Use subj_ columns as-is
= liar_clean.filter(regex=r'^subj_').columns
subj_cols = liar_clean[subj_cols] # already binary, no processing needed
X_subj
# 2. Use perc_true as-is
= liar_clean[["perc_true"]]
X_truth
# 3. Process speaker — limit to top N, encode rest as "Other", then one-hot
= liar_clean["speaker"].value_counts().nlargest(20).index
top_speakers "speaker_grouped"] = liar_clean["speaker"].where(liar_clean["speaker"].isin(top_speakers), "Other")
liar_clean[
= OneHotEncoder(sparse_output=False, handle_unknown='ignore')
speaker_encoder = speaker_encoder.fit_transform(liar_clean[["speaker_grouped"]])
speaker_encoded = speaker_encoder.get_feature_names_out(["speaker_grouped"])
speaker_cols = pd.DataFrame(speaker_encoded, columns=speaker_cols, index=liar_clean.index).astype(int)
df_speaker
# 4. Process date — bin by year
"year"] = pd.to_datetime(liar_clean["date"]).dt.year
liar_clean[= pd.cut(liar_clean["year"], bins=[1999, 2005, 2010, 2015, 2020, 2025], labels=False)
df_year = pd.DataFrame({"year_encoded": df_year.astype(int)}, index=liar_clean.index)
df_year
= pd.concat([X_subj, X_truth, df_speaker, df_year], axis=1)
df_meta = df_meta.columns.tolist() meta_cols
We’re now ready for clustering. Since we are combining both numeric and categorical features in our input matrix, our metric of choice is the Gower distance/similarity, which supports mixed data types. This rules out some standard algorithms:
- KMeans is incompatible because it assumes Euclidean space and doesn’t handle categorical features.
- DBSCAN works with Gower but fails on this dataset due to variable densities — we tested it but omit the results for clarity (the same is true for other density-based techniques that extend DBSCAN such as HDBSCAN and OPTICS: our tests, also not shown for concision, reveal the same pattern as with DBSCAN i.e clustering the whole data into a single cluster).
Instead, we explore three algorithms: - KMedoids, which is robust and Gower-compatible
- Spectral Clustering, ideal for complex structures
- Fuzzy C-Means, a soft clustering method that offers probabilistic assignments
🧩 KMedoids
KMedoids is a partitioning-based clustering method related to KMeans, but it selects actual data points (medoids) as cluster centers rather than computing means. This makes it more interpretable and robust to noise, especially when using arbitrary distance metrics like Gower.
How it works:
- Initialize
k
medoids randomly - Assign each point to the nearest medoid using Gower distance
- Swap medoids to minimize overall intra-cluster dissimilarity
- Repeat until convergence
Feature | KMeans | KMedoids |
---|---|---|
Cluster center | Mean | Actual data point (medoid) |
Distance metric | Euclidean | Any (e.g., Gower) |
Noise sensitivity | High | Lower |
Mixed data support | ❌ | ✅ (with Gower) |
Interpretability | Lower | Higher |
✅ Pros & ❌ Cons
Pros | Cons |
---|---|
Works with arbitrary distances (e.g., Gower) | Slower than KMeans on large datasets |
More robust to noise and outliers | Still requires k to be known |
Medoids are real data points (interpretable) | Less scalable than centroid-based methods |
🌈 Spectral Clustering
Spectral Clustering is a graph-based method that uses the eigenstructure of a similarity matrix to uncover clusters. It excels in detecting non-convex, manifold-shaped, or globally-connected clusters — useful in complex datasets combining text and metadata.
How it works:
- Convert Gower distances to similarities (e.g.,
1 - Gower
) - Build a graph Laplacian from the similarity matrix
- Compute eigenvectors (spectral embedding)
- Apply clustering (typically KMeans) in the embedded space
Feature | Spectral Clustering | KMedoids |
---|---|---|
Graph-based? | ✅ | ❌ |
Requires k ? |
✅ | ✅ |
Handles mixed data? | ✅ (via Gower similarity) | ✅ (via Gower distance) |
Works with complex shapes? | ✅ | Moderate |
✅ Pros & ❌ Cons
Pros | Cons |
---|---|
Captures global data structure | Requires precomputed similarity matrix |
Ideal for complex, non-linear cluster shapes | Memory-intensive on large datasets |
Supports Gower-derived similarity | Requires number of clusters (k ) |
Good for clustering on text + metadata | Sensitive to scale in similarity values |
🌫️ Fuzzy C-Means
Fuzzy C-Means (FCM) is a soft clustering algorithm that assigns partial membership to clusters — useful when cluster boundaries are ambiguous, as with political statements or nuanced text.
⚠️ Limitation: FCM requires Euclidean distance, so we must convert all features to numeric, typically through one-hot encoding for categoricals. This allows the algorithm to run but slightly changes the nature of the data.
How it works:
- Initialize cluster centers
- Compute fuzzy membership scores for each point
- Update cluster centers using weighted averages
- Repeat until convergence
✅ Pros & ❌ Cons
Pros | Cons |
---|---|
Captures ambiguity and overlapping clusters | Requires numeric-only input (not compatible with Gower) |
Soft assignments give richer interpretation | Sensitive to outliers and initialization |
Useful when clusters are not crisply defined | Assumes Euclidean space → categorical encoding may distort data |
🧮 Clustering Comparison
Feature | KMedoids / Spectral | Fuzzy C-Means |
---|---|---|
Assignment type | Hard (1 cluster per point) | Soft (multiple memberships per point) |
Input requirement | Precomputed distances (e.g., Gower for mixed data) | Numeric matrix (e.g., one-hot encoded categoricals) |
Distance metric | Arbitrary (via Gower) | Euclidean (assumes numeric structure) |
Handles ambiguity? | ❌ | ✅ |
Works with Gower? | ✅ (direct Gower compatibility) | ❌ (not directly — requires transformation) |
Handles mixed data? | ✅ (via Gower) | ⚠️ Indirectly — requires numeric encoding |
⚠️ Note: While Fuzzy C-Means doesn’t natively handle categorical variables, it can be applied after transforming mixed data into a fully numeric matrix (e.g., via one-hot encoding). However, doing so introduces Euclidean assumptions that may not reflect true semantic distances, especially with sparse features.
# Function to sanitize text for plotting
def sanitize_text(statement, max_length=100):
= statement.replace('$', r'\$').replace('"', r'\"').replace("'", r"\'")
sanitized return sanitized[:max_length] + '...' if len(sanitized) > max_length else sanitized
= pd.concat([X_lsa_clean_df.reset_index(drop=True), df_meta.reset_index(drop=True)], axis=1)
df_combined
# --- Set correct dtypes ---
# subj_ columns: binary categorical → set to object for Gower to detect them
for col in subj_cols:
= df_combined[col].astype("object")
df_combined[col]
# perc_true: numeric
"perc_true"] = df_combined["perc_true"].astype(float)
df_combined[
# speaker columns (already one-hot encoded): also binary categorical
for col in df_speaker.columns:
= df_combined[col].astype("object")
df_combined[col]
# year_encoded: ordinal categorical
"year_encoded"] = df_combined["year_encoded"].astype("object") df_combined[
We need to determine the number of clusters for Spectral Clustering, KMedoid, and Fuzzy C-Means. For Spectral and KMedoid, we use the Dunn Index. For Fuzzy C-Means, which produces soft cluster memberships, we use the Xie-Beni Index, a metric specifically designed for fuzzy clustering.
The Dunn Index is more robust than the Silhouette score when working with Gower distance for mixed data types. While both metrics can technically work with precomputed distance matrices, the Dunn Index’s focus on cluster separation rather than compactness makes it better suited for the non-Euclidean nature of Gower distances. Other metrics like Calinski-Harabasz and Davies-Bouldin require Euclidean spaces and cannot be directly applied to Gower distances without transformation.
📊 Dunn Index
What is it?
The Dunn Index is an internal clustering validation metric defined as:
Dunn Index = (Minimum inter-cluster distance) / (Maximum intra-cluster distance)
- Encourages tight, well-separated clusters
- Used across different
k
values to select the best clustering
Use in Practice
- Higher Dunn Index = better clustering
- Used to choose
k
for Spectral Clustering and KMedoids
Aspect | Value |
---|---|
Best when | Inter-cluster distance is high, intra-cluster tight |
Output range | ≥ 0 (higher is better) |
Pros | Simple, interpretable |
Cons | Sensitive to noise |
🌀 Xie-Beni Index
What is it?
The Xie-Beni Index is a validity metric tailored to fuzzy clustering (like Fuzzy C-Means). It considers both fuzzy membership strength and cluster separation, offering a balance between compactness and separation in soft clustering contexts.
Xie-Beni Index = (Total weighted intra-cluster variance) / (Minimum cluster center distance²)
Why not Dunn or other indices?
Unlike Spectral or KMedoid, Fuzzy C-Means doesn’t assign each point to a single cluster. Instead, each point has degrees of membership across all clusters. Traditional validation metrics like Dunn, Silhouette, Calinski-Harabasz, or Davies-Bouldin are designed for hard (crisp) clustering, where every point belongs to exactly one cluster. Applying these to fuzzy results would require forced binarization (e.g. via argmax
), which discards the soft nature of the clustering and leads to misleading evaluations.
The Xie-Beni Index, by contrast:
- Accounts for soft assignments directly (using membership weights)
- Evaluates the balance between compactness and separation in the fuzzy space
- Naturally penalizes overlapping clusters and under-separation
Use in Practice
- Lower Xie-Beni Index = better clustering
- Used to choose
c
(number of clusters) for Fuzzy C-Means
Aspect | Value |
---|---|
Best when | Clusters are compact and well-separated with low overlap |
Output range | ≥ 0 (lower is better) |
Pros | Designed for fuzzy clustering, interprets soft labels |
Cons | Sensitive to outliers and high-dimensional noise |
def dunn_index(distance_matrix, labels):
= np.unique(labels)
unique_clusters
= []
intra_dists for i in unique_clusters:
= np.where(labels == i)[0]
cluster_indices if len(cluster_indices) > 1:
= np.max(distance_matrix[np.ix_(cluster_indices, cluster_indices)])
intra_cluster_dist else:
= 0
intra_cluster_dist
intra_dists.append(intra_cluster_dist)
= []
inter_dists for i in unique_clusters:
for j in unique_clusters:
if i < j:
= np.where(labels == i)[0]
idx_i = np.where(labels == j)[0]
idx_j if len(idx_i) > 0 and len(idx_j) > 0:
= np.min(distance_matrix[np.ix_(idx_i, idx_j)])
inter_cluster_dist
inter_dists.append(inter_cluster_dist)
if not intra_dists or not inter_dists or np.max(intra_dists) == 0:
return np.nan
return np.min(inter_dists) / np.max(intra_dists)
def dunn_evaluation(distance_matrix, mode="kmedoids", values=None):
= []
scores = []
labels_all
if values is None:
= range(2, 11)
values
for v in tqdm(values, desc=f"Dunn Index ({mode})"):
try:
match mode:
case "kmedoids":
= KMedoids(n_clusters=v, metric='precomputed', random_state=42)
model
model.fit(distance_matrix)= model.labels_
labels
case "spectral":
= 1 - distance_matrix
sim = SpectralClustering(n_clusters=v, affinity='precomputed', random_state=42).fit(sim)
model = model.labels_
labels
case _:
raise ValueError(f"Unsupported mode: {mode}")
= dunn_index(distance_matrix, labels)
score
scores.append(score)
labels_all.append(labels)
except Exception as e:
print(f"{mode} {v} failed: {e}")
scores.append(np.nan)None)
labels_all.append(
return values, scores, labels_all
def plot_index(values, scores, title, x_label="k",index="Dunn"):
= pd.DataFrame({'value': values, 'dunn': scores}).dropna()
df return (
='value', y='dunn')) +
ggplot(df, aes(x='#1C77C3') +
geom_line(color='#1C77C3', size=3) +
geom_point(color+
ggtitle(title) + ylab(index+" Index") +
xlab(x_label)
theme_minimal()
)
def xie_beni_index(X, u, centers, m=2):
= centers.shape[0]
n_clusters = X.shape[0]
N
= u ** m
um = np.zeros((n_clusters, N))
dist for k in range(n_clusters):
= np.linalg.norm(X - centers[k], axis=1) ** 2
dist[k]
= np.sum(um * dist)
compactness
= np.min([
min_dist - centers[j]) ** 2
np.linalg.norm(centers[i] for i in range(n_clusters) for j in range(n_clusters) if i != j
])return compactness / (N * min_dist)
def xie_beni_evaluation(X, values=range(2, 11)):
= []
scores = []
labels_all
for c in tqdm(values, desc="Xie-Beni Index (Fuzzy C-Means)"):
try:
= cmeans(
cntr, u, _, _, _, _, _ =2.0, error=0.005, maxiter=1000, init=None
X.T, c, m
)= xie_beni_index(X, u, cntr)
xb
scores.append(xb)=0)) # hard labels
labels_all.append(np.argmax(u, axisexcept Exception as e:
print(f"Fuzzy c={c} failed: {e}")
scores.append(np.nan)None)
labels_all.append(
return values, scores, labels_all
#Computing the Gower distance matrix
= gower_matrix(df_combined)
gower_dist
# KMedoids
= dunn_evaluation(gower_dist, mode="kmedoids", values=range(2, 11))
ks_k, dunn_k, labels_k = plot_index(ks_k, dunn_k, title="Dunn Index — KMedoids", x_label="k",index="Dunn")
plot_kmedoid
# Spectral
= dunn_evaluation(gower_dist, mode="spectral", values=range(2, 11))
ks_s, dunn_s, labels_s = plot_index(ks_s, dunn_s, title="Dunn Index — Spectral Clustering", x_label="k",index="Dunn")
plot_spectral
# Fuzzy
= df_combined.copy()
fuzzy_input
for col in fuzzy_input.columns:
if fuzzy_input[col].dtype == 'object':
# Try to convert to numeric
try:
= pd.to_numeric(fuzzy_input[col], errors='raise')
fuzzy_input[col] except:
# If that fails, use factorize (for true categorical text)
= pd.factorize(fuzzy_input[col])[0]
fuzzy_input[col]
# At this point, all columns should be numeric
= fuzzy_input.fillna(0).astype(np.float64)
fuzzy_input = fuzzy_input.values
X_fuzzy = X_fuzzy.T
c_means_input = xie_beni_evaluation(c_means_input, values=range(2, 11))
ks_f, xie_beni_f, labels_f = plot_index(ks_f, xie_beni_f, title="Xie-Beni Index — Fuzzy C-Means Clustering", x_label="c",index="Xie-Beni") plot_fuzzy
100%|██████████| 9/9 [00:04<00:00, 2.06it/s]
Dunn Index (kmedoids): 100%|██████████| 9/9 [00:45<00:00, 5.09s/it]
Dunn Index (spectral): -Beni Index (Fuzzy C-Means): 100%|██████████| 9/9 [00:04<00:00, 2.09it/s] Xie
=gggrid([
g
plot_kmedoid,
plot_spectral,
plot_fuzzy=3)
], ncol+=ggsize(1400,800)
g g
Looking at these plots, each clustering method shows different patterns for determining the optimal number of clusters:
KMedoids (Dunn Index): Shows the highest value at k=2, then drops significantly and gradually increases until k=6 before flattening. While k=2 has the highest Dunn Index, this could be creating overly broad clusters. The “elbow” appears around k=5, which might represent a better balance between cluster separation and meaningful groupings.
Spectral Clustering (Dunn Index): Shows the highest value at k=2, with another peak at k=4-5, then a drop at k=6, followed by another stable period at k=7-9. The interesting feature here is the local maximum at k=4-5, suggesting these might be meaningful cluster counts.
Fuzzy C-Means (Xie-Beni Index): For Xie-Beni Index, lower values indicate better clustering. There’s a significant spike at c=5, which should be avoided. The lowest values appear at c=2,c=3 and c=7, with c=2-3 showing the absolute minimum.
To select the optimal number of clusters, we should compare across methods and look for agreement between methods. \(K=7\) appears to be reasonable across all three plots (rising in KMedoids, local peak in Spectral, local minimum in Fuzzy C-Means).
Given the evidence from these plots, k=7 appears to be a reasonable choice that’s supported across methods while avoiding the extreme values that might represent either too few or too many clusters.
# Try a fixed number of clusters
= 1-gower_dist
gower_sim = 7
n_clusters
# --- Spectral Clustering ---
= SpectralClustering(n_clusters=n_clusters, affinity='precomputed', random_state=42)
spec_7clusters = spec_7clusters.fit_predict(gower_sim)
labels_spec_7clusters
= KMedoids(n_clusters=n_clusters, metric='precomputed',random_state=42,max_iter=300)
kmedoids_7clusters
# Fit the model
kmedoids_7clusters.fit(gower_dist)# Get cluster labels
= kmedoids_7clusters.labels_
labels_7clusters
# Run Fuzzy C-means
= cmeans(
cntr_7clust, u_7clust, _, _, _, _, _ =n_clusters, m=2.0, error=0.005, maxiter=1000, init=None
c_means_input, c
)
= np.argmax(u_7clust, axis=0) labels_fuzzy_7clust
Since our clustering is performed on the high dimensional matrix directly (5733 by 5733 Gower distance matrix for Spectral clustering and KMedoid and original 5733 by 272 matrix for Fuzzy C-means!), we need to reduce the dimensionality of our data with UMAP to be able to visualise it.
= umap.UMAP(n_components=2, metric='precomputed', random_state=42)
umap_model_2d = umap_model_2d.fit_transform(gower_dist)
umap_embedding_2d = umap.UMAP(n_components=2, metric='euclidean', random_state=42)
umap_model_fuzzy = umap_model_fuzzy.fit_transform(X_fuzzy)
umap_embedding_fuzzy
# Plotting helper
def plot_clusters(xy, labels, title, meta):
= pd.DataFrame({
df_plot "x": xy[:, 0],
"y": xy[:, 1],
"cluster": labels.astype(str), # 👈 Cast to string
"statement": meta["statement"],
"perc_true": meta["perc_true"],
"speaker": meta["speaker"],
"date": meta["date"]
})
= px.scatter(df_plot, x='x', y='y', color='cluster',
fig =["statement", "perc_true", "speaker", "date"],
hover_data=0.7, title=title,
opacity=px.colors.qualitative.T10) # optional: better color palette
color_discrete_sequence
fig.show()
= liar_clean.index
X_combined_index
# Plot Spectral
"Spectral Clustering (n=7)", liar_clean.loc[X_combined_index])
plot_clusters(umap_embedding_2d, labels_spec_7clusters,
# Plot KMedoid
"KMedoid Clustering (n=7)", liar_clean.loc[X_combined_index])
plot_clusters(umap_embedding_2d, labels_7clusters,
"Fuzzy C-Means Clustering (n=7)", liar_clean.loc[X_combined_index]) plot_clusters(umap_embedding_fuzzy, labels_fuzzy_7clust,
What does this clustering tell us?
Looking at our clustering results with n=7 across the three different methods (Fuzzy C-Means, KMedoid, and Spectral Clustering), alongside differences in data preprocessing, we can draw several important insights regarding our dataset’s underlying structure and the suitability of this cluster count.
Fuzzy C-Means Clustering (n=7 on raw mixed data)
Despite specifying 7 clusters, the visualization reveals only two distinct clusters (labeled 3 and 4), with cluster 4 showing some internal structure. This outcome suggests:
- The algorithm has effectively collapsed the data into a binary structure, assigning almost all points to just two major groups.
- This may indicate that our original feature space contains a dominant binary signal, possibly due to a few influential features.
- Membership values are largely concentrated, which implies that the fuzzy clustering doesn’t detect 7 meaningful divisions in this raw representation.
This aligns with our validation metrics (Dunn Index and Xie-Beni), both of which also pointed to 2–3 clusters as potentially optimal.
KMedoid Clustering (n=7 on Gower distance matrix)
This method produced well-separated clusters across the 2D UMAP projection, with all 7 clusters (0–6) clearly represented:
- Some clusters (e.g., 0, 1, 2) are distinctly separated, particularly on the left side of the visualization.
- Others (3, 4, 6) show more overlap in the center-right region, suggesting either fuzzier boundaries or latent substructure.
- Clusters generally appear coherent and well-shaped, indicating meaningful groupings in the transformed space.
Using the Gower distance matrix here proves advantageous, as it allows for balanced handling of both categorical and numeric features in your dataset.
Spectral Clustering (n=7 on Gower distance matrix)
Spectral clustering yielded a similar structure to KMedoid, supporting the validity of the Gower representation:
- All 7 clusters are visible, with cluster 3 forming a distinct band.
- While the overall shapes differ slightly from KMedoid (especially in the right half), the core structural patterns are consistent.
- There is some overlap between clusters, but also clear regions of separation, especially on the left.
Comparison Across Methods
- Consistency:
- KMedoid and Spectral clustering show similar global patterns, validating the structure uncovered via the Gower matrix.
- Fuzzy C-Means deviates significantly, emphasizing a binary grouping — this difference stems from working directly on the raw feature space.
- Data Representation Matters:
- Fuzzy C-Means is sensitive to dominant features in the raw data and may be overpowered by a few strong signals.
- The Gower distance matrix enables more nuanced group detection by normalizing contributions across mixed feature types.
- Cluster Separability and Interpretability:
- Fuzzy C-Means suggests a high-level binary division might be most natural.
- KMedoid and Spectral offer finer-grained subgroups, useful for in-depth exploration or downstream tasks like classification or profiling.
Conclusions
- If our goal is high-level categorization (e.g., separating broad truth/falsity groupings or political leanings):
- Fuzzy C-Means with 2–3 clusters might be optimal.
- If our goal is detailed segmentation of our dataset that respects the complex interplay of text and metadata:
- KMedoid or Spectral clustering with n=7, using the Gower matrix, provides better differentiation and structure.
Digging a bit further into Fuzzy C-Means
A peculiarity of Fuzzy C-Means is that it doesn’t quite assign a definitive cluster label to each data point. Instead, it produces a membership matrix \(U\), where each entry \(u_{ij}\) indicates the degree to which point \(j\) belongs to cluster \(i\). These degrees sum to 1 across clusters for each point, reflecting a soft assignment.
The typical practice of assigning each data point to a single cluster is done a posteriori by selecting the cluster with the highest membership score — mathematically, by taking the argmax
over the membership matrix. This effectively converts the soft assignment into a hard label, creating a crisp partition similar to what you’d get from KMeans or KMedoids. While this makes comparison across clustering methods easier, it discards the soft assignment information — and in doing so, may obscure meaningful ambiguity or uncertainty in the data.
One way to better understand this ambiguity is to visualize the maximum membership strength each point has. A value near 1 indicates a confident assignment, while lower values suggest that the point lies closer to a boundary between clusters, or shares affinities with multiple groupings.
Below is a UMAP projection of the data used in the fuzzy clustering step, where points are colored by their maximum membership value:
# Max membership value per point (degree of "confidence")
= u_7clust.max(axis=0)
max_membership
# Create a DataFrame for plotting
= pd.DataFrame({
df_membership_plot "x": umap_embedding_fuzzy[:, 0],
"y": umap_embedding_fuzzy[:, 1],
"max_membership": max_membership
})
# Plot: color by maximum membership value
= px.scatter(df_membership_plot, x='x', y='y',
fig ='max_membership',
color='viridis',
color_continuous_scale="Fuzzy C-Means: Maximum Membership Strength (UMAP projection)",
title=0.75)
opacity fig.show()
This plot gives a more nuanced view of the Fuzzy C-Means clustering result: regions with high membership indicate strong, unambiguous cluster identity, while more diffuse or mixed-color areas highlight zones of uncertainty — where Fuzzy C-Means acknowledges that the data doesn’t neatly separate.
Key Observations
Cluster Layout - A large, elongated central mass stretches horizontally around ( y ), spanning from roughly ( x = -5 ) to ( x = 7 ) - Several more compact clusters are visible on the far right (( x > 8 )), some of which show more distinct structure - A scattering of isolated points and micro-clusters appears at the edges
Membership Strength Distribution - The color gradient ranges from approximately 0.143 (deep purple) to 0.144 (yellow-green) - This extremely narrow range of maximum membership values suggests that nearly all points have roughly equal partial membership across all 7 clusters - Points in the rightmost clusters tend to have slightly higher confidence (brighter coloring), whereas the central mass is uniformly ambiguous (darker tones)
Interpretation and Implications
High Ambiguity in Cluster Assignments
The theoretical maximum membership value for 7 equally overlapping clusters is about 1/7 ≈ 0.143, which matches the observed range. This strongly suggests that most points are not confidently assigned to any single cluster — an indicator of high overlap and fuzzy boundaries.Possible Overclustering
The absence of high membership values indicates that 7 clusters may be too many for the structure present in the data. Previous validation metrics like the Xie-Beni and Dunn index pointed to 2–5 clusters as more optimal. The current setting may be splitting natural groupings unnecessarily, resulting in soft, indistinct divisions.Heterogeneous Data Space
The combination of textual features (via LSA) and metadata likely produces a high-dimensional, mixed-type space. In such settings, fuzzy clustering can struggle to identify compact, well-separated groups, especially when dominant features dilute signal from weaker but meaningful ones.Localized Certainty in Some Regions
The right-side clusters in the UMAP projection show slightly higher membership values, suggesting that some portions of the data do form clearer, more self-contained clusters — even if the overall structure remains diffuse.
Conclusion
The Fuzzy C-Means result, viewed through the lens of maximum membership strength, reveals that the clustering model sees the dataset as highly ambiguous — with few points belonging clearly to a single group. This, combined with visual cues and clustering metrics, suggests that:
- Fewer clusters (e.g. 2–5) might better reflect the natural structure
- Fuzzy C-Means is sensitive to the representation used; preprocessing and distance choice matter
- For this dataset, methods that better accommodate mixed data types (like KMedoids or Spectral Clustering with Gower distance) may offer sharper partitions and more interpretable structure
This visualization offers a valuable diagnostic tool: it doesn’t just show where points fall in space, but how certain the algorithm is about their group identity — and that uncertainty speaks volumes.
🧠 Consensus Clustering: What is it and why might we need it?
Instead of relying on just one clustering algorithm, we can go a step further: what if we combined the results from all the methods we’ve tested — like KMedoid, Spectral Clustering, and Fuzzy C-Means — into a unified solution?
This is the idea behind consensus clustering. It’s particularly useful when clustering outputs are noisy, diverging, or hard to interpret. A consensus can consolidate differing outputs, reduce variance, and often improve robustness by integrating complementary perspectives on the data.
When multiple clustering algorithms yield divergent results, consensus clustering offers a principled way to combine these perspectives into a single, unified clustering. Here, we explore three common methods:
1. 🧮 Hard Voting (Majority Rule)
How it Works: Each clustering method “votes” on the label for each data point. The most frequent label across methods is chosen as the consensus assignment.
- If there is a tie, it can be broken randomly or resolved by a priority scheme.
Example:
Data Point | KMedoid | Spectral | Fuzzy (argmax) | Consensus Label |
---|---|---|---|---|
A | 0 | 1 | 0 | 0 (2 votes) |
B | 1 | 1 | 2 | 1 (2 votes) |
C | 2 | 2 | 2 | 2 (3 votes) |
✅ Pros vs ❌ Cons
Pros | Cons |
---|---|
Simple, fast, and interpretable | Ignores uncertainty and cluster proximity |
Doesn’t require distance metrics | Cannot handle ambiguity or soft clustering |
Useful for clearly separable data | Fails if all methods strongly disagree (e.g. all labels different) |
2. 🔁 Reclustering on One-Hot Encoded Assignments
How it Works: - Each method’s cluster assignments are converted into one-hot encoded vectors. - These vectors are concatenated into a new feature matrix (per data point). - A clustering algorithm (e.g., Agglomerative Clustering) is then run on this matrix to find a consensus.
Example:
Say we have 3 clustering methods (each assigning labels for 3 clusters):
Data Point | KMedoid One-Hot | Spectral One-Hot | Fuzzy Argmax One-Hot | Combined Vector |
---|---|---|---|---|
A | [1, 0, 0] | [0, 1, 0] | [1, 0, 0] | [1, 0, 0, 0, 1, 0, 1, 0, 0] |
B | [0, 1, 0] | [0, 1, 0] | [0, 0, 1] | [0, 1, 0, 0, 1, 0, 0, 0, 1] |
C | [0, 0, 1] | [1, 0, 0] | [0, 0, 1] | [0, 0, 1, 1, 0, 0, 0, 0, 1] |
This matrix becomes the input to a new clustering.
✅ Pros vs ❌ Cons
Pros | Cons |
---|---|
Captures voting patterns across methods | Still treats each assignment as binary (no ambiguity) |
Works well when clusters partially align | Sensitive to label encoding inconsistencies across methods |
Doesn’t require original data or distance metrics | High-dimensional if many methods or clusters are used |
3. 🌈 Reclustering on One-Hot + Fuzzy Memberships
How it Works: This extends the one-hot strategy by appending the soft cluster membership scores from Fuzzy C-Means to the one-hot encoded vectors. This allows the consensus clustering to also consider the confidence levels in Fuzzy assignments.
Example:
Data Point | Combined One-Hot (from above) | Fuzzy Memberships | Final Vector |
---|---|---|---|
A | [1, 0, 0, 0, 1, 0, 1, 0, 0] | [0.70, 0.20, 0.10] | [1, 0, 0, 0, 1, 0, 1, 0, 0, 0.70, 0.20, 0.10] |
B | [0, 1, 0, 0, 1, 0, 0, 0, 1] | [0.35, 0.33, 0.32] | [0, 1, 0, 0, 1, 0, 0, 0, 1, 0.35, 0.33, 0.32] |
C | [0, 0, 1, 1, 0, 0, 0, 0, 1] | [0.05, 0.15, 0.80] | [0, 0, 1, 1, 0, 0, 0, 0, 1, 0.05, 0.15, 0.80] |
This approach creates a richer, more expressive representation per point.
✅ Pros vs ❌ Cons
Pros | Cons |
---|---|
Incorporates soft assignments → better reflects ambiguity | Slightly more complex to implement |
Richer, high-dimensional representation with more nuance | Sensitive to differences in scale (needs normalization) |
Can differentiate between ambiguous and confident clusterings | May require more sophisticated reclustering algorithms |
🧭 Which to Use?
Use Case | Recommended Strategy |
---|---|
You want fast, simple consensus | Hard Voting |
You want a method that reflects patterns across methods | One-Hot Reclustering |
You want to incorporate ambiguity and confidence | One-Hot + Fuzzy Memberships |
First approach: Voting approach
# Stack labels from different methods (all should be of shape (n_samples,))
= np.vstack([
all_labels
labels_spec_7clusters,# KMedoids
labels_7clusters, # Fuzzy C-Means
labels_fuzzy_7clust
])
# Compute the mode along axis=0 (i.e., majority vote for each point)
= mode(all_labels, axis=0, keepdims=False)
voted_labels, _
# Visualize using the same UMAP embedding (e.g., the Gower-based one)
"Consensus Clustering (Voting Majority)", liar_clean.loc[X_combined_index]) plot_clusters(umap_embedding_2d, voted_labels,
This first method predictably fails, since the labels of the three methods are not aligned.
Second approach : one-hot reclustering approach
# Combine cluster labels as categorical features
= np.vstack([
labels_matrix
labels_spec_7clusters,
labels_7clusters,
labels_fuzzy_7clust# shape: (n_samples, 3)
]).T
# One-hot encode the cluster labels for each method
= OneHotEncoder(sparse_output=False)
encoder = encoder.fit_transform(labels_matrix)
labels_onehot
# Perform Agglomerative Clustering on this feature matrix
= AgglomerativeClustering(n_clusters=n_clusters, metric='euclidean', linkage='ward')
agg_cluster = agg_cluster.fit_predict(labels_onehot)
labels_consensus_agg
# Visualize
"Consensus Clustering (One-Hot Reclustering approach)", liar_clean.loc[X_combined_index]) plot_clusters(umap_embedding_2d, labels_consensus_agg,
The clusters are more defined with this approach but still quite noisy.
Third approach: One-hot + Fuzzy Memberships reclustering approach
# One-hot encode KMedoid and Spectral labels
= np.vstack([
labels_matrix
labels_spec_7clusters,
labels_7clusters# shape (n_samples, 2)
]).T
from sklearn.preprocessing import OneHotEncoder
= OneHotEncoder(sparse_output=False)
encoder = encoder.fit_transform(labels_matrix) # shape (n_samples, 14) if 7 clusters each
labels_onehot
# Concatenate fuzzy membership matrix (transpose to shape (n_samples, 7))
= np.hstack([
combined_features # Hard cluster assignments
labels_onehot, # Soft assignments from FCM
u_7clust.T
])
# Agglomerative reclustering on combined cluster representations
= AgglomerativeClustering(n_clusters=n_clusters, metric='euclidean', linkage='ward')
agg_cluster_soft = agg_cluster_soft.fit_predict(combined_features)
labels_consensus_soft
# Visualize
"Consensus Clustering (One-hot + Fuzzy Memberships reclustering approach)", liar_clean.loc[X_combined_index]) plot_clusters(umap_embedding_2d, labels_consensus_soft,
Quite predictably, this method performs best out of the consensus methods and the clusters are much more distinct.
Anomaly detection
After performing clustering, we take our analysis a step further by identifying anomalies i.e statements that don’t fit in with the rest/stand out the most.
Since anomaly detection often involves uncertainty and noisy signals, we’ll apply three different algorithms to get multiple perspectives: - Isolation Forest - Local Outlier Factor (LOF) - One-Class SVM
Just like with clustering, the results may differ. So we’ll combine them via consensus voting to identify strongly agreed-upon anomalies.
🌲 Isolation Forest
Isolation Forest works by isolating points in the data through recursive partitioning — anomalies are isolated quickly and thus have shorter average path lengths.
Key Parameter:
contamination=0.05
: Specifies the expected proportion of outliers in the data (5%). This is important — the algorithm will “force” itself to label that fraction of points as anomalous.
🧭 Local Outlier Factor (LOF)
LOF measures local density deviation. Points that have significantly lower density compared to their neighbors are considered anomalies.
Key Parameters:
n_neighbors=20
: Number of neighbors to use when estimating local density. Larger values make the model less sensitive to local fluctuations.contamination=0.05
: Again, this sets the expected proportion of anomalies.
🧠 One-Class SVM (OC-SVM)
What is it?
One-Class SVM is a variant of the traditional SVM algorithm, adapted for unsupervised anomaly detection.
While standard SVMs are trained on labeled data to separate known classes (e.g. cat vs. dog), One-Class SVM takes only unlabeled data (assumed to be mostly “normal”) and tries to learn the boundary of this normal region.
✅ This makes it unsupervised — it requires no prior labeling of anomalies during training.
Imagine trying to wrap a tight boundary around all the “normal” data points. Any point falling outside this boundary is flagged as anomalous. This is especially helpful when anomalies are rare or not well-defined in advance.
How It Works:
- Learns a decision function that best encloses the data in a high-dimensional space.
- Uses the RBF (Radial Basis Function) kernel to capture non-linear boundaries.
- Points outside the learned hypersphere are labeled as anomalies.
Key Parameters:
kernel='rbf'
: Allows the model to find curved, non-linear boundaries in feature space.nu=0.05
: An upper bound on the fraction of anomalies (outliers). Also acts as a regularizer.
Strengths
- Good for tight, compact clusters of normal data.
- Effective when anomalies are far away from the main distribution.
Weaknesses
- Can struggle with sparse or noisy data.
- Sensitive to feature scaling — data needs to be well-preprocessed.
- Not great if normal data is spread out or multi-modal.
🆚 One-Class SVM vs. Isolation Forest vs. LOF
Feature | One-Class SVM | Isolation Forest | Local Outlier Factor (LOF) |
---|---|---|---|
Supervision | Unsupervised | Unsupervised | Unsupervised |
Assumption | Most data is “normal” | Anomalies are easier to isolate | Anomalies have lower local density |
Boundary Type | Tight enclosing boundary (global) | Random partitions (tree-based) | Local density comparison |
Sensitivity | Global (sensitive to scaling) | Robust to high-dimensional noise | Local context-dependent |
Interpretability | Moderate (abstract boundary) | High (tree-paths & feature splits) | Moderate (density comparison) |
Best Use Case | Small-to-medium data; subtle anomalies | High-dimensional or noisy datasets | When local density variation is key |
🌳 Isolation Forest
# Step 1: Prepare data
= df_combined.copy()
df_isolation_ready
# Convert categorical columns to integers
for col in df_isolation_ready.select_dtypes(include="object").columns:
= df_isolation_ready[col].astype(int)
df_isolation_ready[col]
# Safe float conversion
= df_isolation_ready.values.astype(float)
X_isolation
# Step 2: Dimensionality Reduction (UMAP)
= umap.UMAP(n_components=2, random_state=42)
umap_model = umap_model.fit_transform(X_isolation)
X_umap
# Step 3: Fit Isolation Forest
= IsolationForest(contamination=0.05, random_state=42)
iso_forest = iso_forest.fit_predict(X_isolation)
y_pred_isolation
# Step 4: Create a DataFrame for plotting
= pd.DataFrame({
df_plot "x": X_umap[:, 0],
"y": X_umap[:, 1],
"anomaly": y_pred_isolation
=liar_clean.index)
}, index
"statement"] = liar_clean["statement"]
df_plot["perc_true"] = liar_clean["perc_true"]
df_plot["speaker"] = liar_clean["speaker"]
df_plot["date"] = liar_clean["date"]
df_plot[
# Step 5: Convert anomaly labels to string for discrete coloring
"anomaly"] = df_plot["anomaly"].map({1: "normal", -1: "anomaly"})
df_plot[
# Step 6: Interactive plot
= px.scatter(
fig
df_plot,='x',
x='y',
y='anomaly',
color=["statement", "perc_true", "speaker", "date"],
hover_data=0.7,
opacity="Isolation Forest Anomaly Detection (UMAP projection)",
title=px.colors.qualitative.T10
color_discrete_sequence
) fig.show()
🎯 Local Outlier Factor
= LocalOutlierFactor(n_neighbors=20, contamination=0.05)
lof = lof.fit_predict(X_isolation)
y_pred_lof # Step 4: Create a DataFrame for plotting
= pd.DataFrame({
df_plot "x": X_umap[:, 0],
"y": X_umap[:, 1],
"anomaly": y_pred_lof
=liar_clean.index)
}, index
"statement"] = liar_clean["statement"]
df_plot["perc_true"] = liar_clean["perc_true"]
df_plot["speaker"] = liar_clean["speaker"]
df_plot["date"] = liar_clean["date"]
df_plot[
# Step 5: Convert anomaly labels to string for discrete coloring
"anomaly"] = df_plot["anomaly"].map({1: "normal", -1: "anomaly"})
df_plot[
# Step 6: Interactive plot
= px.scatter(
fig
df_plot,='x',
x='y',
y='anomaly',
color=["statement", "perc_true", "speaker", "date"],
hover_data=0.7,
opacity="LOF Anomaly Detection (UMAP projection)",
title=px.colors.qualitative.T10
color_discrete_sequence
) fig.show()
🧠 OneClassSVM
= OneClassSVM(kernel='rbf', gamma='scale', nu=0.05) # nu = approx fraction of anomalies
svm = svm.fit_predict(X_isolation)
y_pred_svm = pd.DataFrame({
df_plot "x": X_umap[:, 0],
"y": X_umap[:, 1],
"anomaly": y_pred_svm
=liar_clean.index)
}, index
"statement"] = liar_clean["statement"]
df_plot["perc_true"] = liar_clean["perc_true"]
df_plot["speaker"] = liar_clean["speaker"]
df_plot["date"] = liar_clean["date"]
df_plot[
# Step 5: Convert anomaly labels to string for discrete coloring
"anomaly"] = df_plot["anomaly"].map({1: "normal", -1: "anomaly"})
df_plot[
# Step 6: Interactive plot
= px.scatter(
fig
df_plot,='x',
x='y',
y='anomaly',
color=["statement", "perc_true", "speaker", "date"],
hover_data=0.7,
opacity="OneClassSVM Anomaly Detection (UMAP projection)",
title=px.colors.qualitative.T10
color_discrete_sequence
) fig.show()
Overlaying the results of Isolation Forest and OneClassSVM
We overlay the results of two methods (Isolation Forest and One-Class SVM) to explore areas of agreement or disagreement:
Color | Meaning |
---|---|
Red | Detected by both |
Orange | Isolation Forest only |
Purple | One-Class SVM only |
Light gray | Neither |
# Step 2: Create overlay DataFrame
= pd.DataFrame({
overlay_df "x": X_umap[:, 0],
"y": X_umap[:, 1],
"isolation": y_pred_isolation, # Isolation Forest predictions
"ocsvm": y_pred_svm # One-Class SVM predictions
=liar_clean.index)
}, index
# Map labels to binary (1 = normal, -1 = anomaly)
"status"] = overlay_df.apply(
overlay_df[lambda row: "both anomaly" if row["isolation"] == -1 and row["ocsvm"] == -1
else "isolation forest only" if row["isolation"] == -1
else "one class SVM only" if row["ocsvm"] == -1
else "normal",
=1
axis
)
# Add metadata
"statement"] = liar_clean["statement"]
overlay_df["perc_true"] = liar_clean["perc_true"]
overlay_df["speaker"] = liar_clean["speaker"]
overlay_df["date"] = liar_clean["date"]
overlay_df[
# Step 3: Plot with Plotly
= px.scatter(
fig
overlay_df,='x',
x='y',
y='status',
color=["statement", "perc_true", "speaker", "date"],
hover_data=0.7,
opacity="Anomaly Detection Overlay: Isolation Forest vs One-Class SVM",
title={
color_discrete_map"both anomaly": "red",
"isolation forest only": "orange",
"one class SVM only": "purple",
"normal": "lightgray"
}
) fig.show()
🗳️ Consensus Anomaly Detection
Since each method captures different types of outliers, we apply a voting strategy: - Each method outputs a binary prediction (anomaly or not). - We tally how many methods voted for each point being an anomaly: - 1 method → Weak anomaly - 2 methods → Moderate anomaly - 3 methods → Strong consensus
This gives a graded view of anomalousness and reduces reliance on any single method.
✅ This consensus approach mirrors our earlier strategy with clustering — acknowledging ambiguity and uncertainty by aggregating multiple signals.
= pd.DataFrame(index=liar_clean.index)
df_anomaly_votes "iforest"] = y_pred_isolation # From earlier
df_anomaly_votes["svm"] = y_pred_svm
df_anomaly_votes["lof"] = y_pred_lof
df_anomaly_votes[
# Convert to binary (1 = anomaly)
= lambda x: 1 if x == -1 else 0
binary = df_anomaly_votes.applymap(binary)
df_anomaly_votes
# Count how many methods flagged as anomaly
"votes"] = df_anomaly_votes.sum(axis=1)
df_anomaly_votes[
# Inspect most agreed-upon anomalies
= df_anomaly_votes[df_anomaly_votes["votes"] >= 2] df_strong_outliers
# Step 1: Combine anomaly labels from all methods
= pd.DataFrame(index=liar_clean.index)
df_votes "IsolationForest"] = y_pred_isolation
df_votes["OneClassSVM"] = y_pred_svm
df_votes["LOF"] = y_pred_lof
df_votes[
# Step 2: Convert to binary (1 = anomaly, 0 = normal)
= df_votes.applymap(lambda x: 1 if x == -1 else 0)
df_votes
# Step 3: Count how many methods flagged each point as an anomaly
"consensus"] = df_votes.sum(axis=1)
df_votes[
# Step 4: Create plot DataFrame with UMAP coordinates
= pd.DataFrame({
df_consensus_plot "x": X_umap[:, 0],
"y": X_umap[:, 1],
"consensus": df_votes["consensus"],
"statement": liar_clean["statement"],
"perc_true": liar_clean["perc_true"],
"speaker": liar_clean["speaker"],
"date": liar_clean["date"]
})
# Step 5: Map consensus values to categories for color
def label_consensus(v):
if v == 0:
return "Not an anomaly"
elif v == 1:
return "Weak (1 method)"
elif v == 2:
return "Moderate (2 methods)"
else:
return "Strong (3 methods)"
"consensus_label"] = df_consensus_plot["consensus"].map(label_consensus)
df_consensus_plot[
# Step 6: Plot with Plotly
import plotly.express as px
= px.scatter(
fig
df_consensus_plot,="x",
x="y",
y="consensus_label",
color=["statement", "perc_true", "speaker", "date"],
hover_data="Consensus Anomaly Detection (UMAP projection)",
title=0.75,
opacity=px.colors.qualitative.Safe
color_discrete_sequence
) fig.show()
These algorithms chosen are identifying anomalies based on different principles: Isolation Forest on isolation ease, LOF on local density comparison, and OneClassSVM on boundary definition. The consensus approach helps provide confidence in anomaly identification by showing where methods agree. Here are the conclusions we can draw from all the analyses run in this anomaly detection section:
- Isolation Forest:
- Shows a binary classification of normal (blue) vs anomaly (orange)
- Identifies anomalies scattered throughout the dataset
- Particularly focuses on points in the right clusters and some in the left/central region
- Appears to be more liberal in anomaly identification compared to LOF
- Local Outlier Factor (LOF):
- More conservative in anomaly detection than Isolation Forest
- Primarily identifies anomalies in the upper right clusters
- Very few anomalies detected in the central/main cluster
- Focuses on local density deviations
- OneClassSVM Anomaly Detection:
- Distribution of anomalies similar to Isolation Forest
- Detects anomalies scattered throughout the dataset
- Identifies more anomalies in the central region than LOF
- Points at the boundaries of clusters are often flagged
- Overlay Comparison:
- Shows agreement/disagreement between Isolation Forest and OneClassSVM
- Gray points are “normal” according to both methods
- Purple points are anomalies detected only by OneClassSVM
- Yellow points are anomalies detected only by Isolation Forest
- Red points are anomalies detected by both methods
- Significant overlap between the two methods, but also unique detections
- Consensus Anomaly Detection:
- This shows a gradation of anomaly strength based on agreement between methods
- Light blue points represent “Not an anomaly”
- Pink points are “Weak” anomalies (detected by 1 method)
- Orange points are “Moderate” anomalies (detected by 2 methods)
- Green points are “Strong” anomalies (detected by all 3 methods)
- The strongest consensus anomalies appear predominantly in the smaller clusters on the right side
Key observations across methods: - The smaller clusters on the right side (x=10-15) consistently show higher anomaly rates - The central elongated cluster (x=-5 to 5) has sporadic anomalies - Isolation Forest and OneClassSVM appear more similar to each other than to LOF - LOF focuses more on local density, while the other methods may be more sensitive to global structure - The consensus approach provides a more nuanced view of anomaly strength
Topic modeling : Latent Dirichlet Allocation (LDA)
After clustering and anomaly detection, we move to topic modeling and we start with LDA.
For LDA, we need to use the data before the LSA transformation and prior to formation of the DFM (i.e pre-processed text prior to TF-IDF transformation). LDA works best with raw counts rather than TF-IDF weights because it’s a probabilistic model that assumes a generative process based on word frequencies so we go back to the pre-processed text prior to TF-IDF transformation and use CountVectorizer
this time to create a DFM.
= CountVectorizer(
count_vectorizer =0.5,
max_df=5,
min_df=(1, 3), # can also try (2, 3) for stricter phrase focus
ngram_range
)= count_vectorizer.fit_transform(docs) X_counts
We can use topic stability and topic diversity/exclusivity as before to determine the number of topics for LDA.
range(2,25)) analyze_components(X_counts,final_texts, count_vectorizer,
1. Evaluating Topic Stability...
2. Evaluating Topic Quality...
=== Analysis Complete ===
set the optimal number of components.
Based on the plots, manually
5 Selected number of components:
The curves show a good trade-off between stability and diversity/exclusivity at a number of topics of 5. So we’ll set the number of topics to 5.
= LatentDirichletAllocation(
lda =5, # Number of topics
n_components=42,
random_state='online',
learning_method=25
max_iter
)
= lda.fit_transform(X_counts)
lda_output
# Display topics
= count_vectorizer.get_feature_names_out()
feature_names for topic_idx, topic in enumerate(lda.components_):
print(f"Topic #{topic_idx+1}:")
print("|".join([feature_names[i] for i in topic.argsort()[:-11:-1]]))
print()
#1:
Topic |say|job|bill|the_united_states|even|now|get|money|year
have
#2:
Topic |people|obama|support|want|make|vote|family|right|trump
say
#3:
Topic |country|have|take|cost|get|america|come|american|congress
say
#4:
Topic |work|say|spend|woman|republican|the|wisconsin|state|florida
government
#5:
Topic |year|state|percent|have|say|time|pay|just|texas more
Let’s out the top statements per topic.
= lda.transform(X_counts) # shape: (n_docs, n_topics)
doc_topic_dist 'dominant_topic'] = doc_topic_dist.argmax(axis=1) liar[
for topic_num in range(lda.n_components):
print(f"\n=== TOPIC {topic_num} ===")
= doc_topic_dist[:, topic_num].argsort()[::-1][:5]
top_indices for i in top_indices:
print(f"- {liar.iloc[i]['statement']}")
=== TOPIC 0 ===
- Russia and China "absolutely said pretty clearly" they would not keep economic sanctions on Iran if the United States "walked away from the deal.
- Joe Biden's sanctions on Russia "are riddled with loopholes and don't even start for 30 days. They have carve outs for the energy and financial sectors.
- Soybeans dropped 70 cents and corn 50 cents per bushel in two days, causing farmers to lose big money, "thanks to Biden's executive orders.
- The United States is "about to have the smallest Army since before WWII, the smallest Navy since WWI and the smallest Air Force ever.
- We've got 40 years of study now that show that, with a good preschool start, you're less likely to be on public dole, you're less likely to be in prison.
=== TOPIC 1 ===
- Says "Morgan Carroll opposed requiring convicted sex offenders to register their online profiles, making it harder to track online sex offenders and child predators.
- Says Bernie Sanders once said, "Instead of removing the conditions that make people depressed, modern society gives them antidepressant drugs.
- Says after leaving the White House, Hillary Clinton "was forced to return an estimated $200,000 in White House china, furniture and artwork that she had stolen.
- Photo shows Dianne Feinstein "bullying" Lisa Murkowski over "the now empty seat of the late Justice Ruth Bader Ginsburg.
- In the primary election, a "right-wing group spent more than $100,000,000 to support" Supreme Court Justice Rebecca Bradley.
=== TOPIC 2 ===
- Says Virginia Gov. Ralph Northam said, 'You will give up your guns, if you don't I'll have the National Guard cut your power, your phone lines, and your internet. Then, if you still refuse to comply I'll have you killed.'
- Ketanji Brown Jackson "says she gave pedophiles lighter sentences (because) it's different when they use computers vs mail to get volumes of child porn. This makes 'total sense' according to Jackson.
- The Trans-Pacific Partnership "knocks out 18,000 tariffs that other countries place on American products and goods.
- Says that Sen. Sherrod Brown is "out there egging on a lot of these protesters who are spitting on policemen and going to the bathroom on policemen's cars at these protests on Wall Street and other places.
- Protest vandalism in Richmond "all started two weeks ago when Mayor Stoney's police gassed - tear gassed - a peaceful crowd of protesters, moms and children, at 7:30 p.m. sharp.
=== TOPIC 3 ===
- Human error, such as skipping a community in the Wisconsin Supreme Court race, "is common in (the vote-tabulation) process.
- Says Pope Francis "was arrested Saturday in connection with an 80-count indictment of charges including possession of child pornography, human trafficking, incest, possession of drug paraphernalia and felony fraud.
- Between 2008 and 2014, "criminal aliens accounted for 38 percent of all murder convictions in the five states of California, Texas, Arizona, Florida and New York.
- If Russia, or some other entity, was hacking, why did the White House wait so long to act? Why did they only complain after Hillary lost?
- Says a California surfer was "alone, in the ocean," when he was arrested for violating the state's stay-at-home order.
=== TOPIC 4 ===
- A small business receiving a federal tax credit "to provide health care for its employees ... would be barred under the Stupak Amendment from allowing their employees to use ... the health care that they offer them for abortion.
- The top 1 percent of income earners pay 40 percent of all state income taxes, and those at the bottom pay little or nothing.
- Mark Kelly "voted to allow prison inmates to receive stimulus checks five separate times.
- Foxconn will not face oversight from any federal, state, or local agency to guarantee it complies with our wetlands protection laws.
- The House voted to cap insulin prices to $35/month, just $420 each year! Richard Hudson voted NO.
Looking at these LDA results, here’s what we can say:
Topic Analysis:
Topic 0: This topic appears focused on international relations, economic impacts, and national security. Key terms include “Russia,” “China,” “sanctions,” “Iran,” “United States,” and references to military size. This could represent discussions about foreign policy and defense.
Topic 1: This topic centers on political allegations, accusations against politicians, and controversial claims. Notable mentions include references to sex offenders, stolen items from the White House, and campaign spending. This seems to capture political attack messaging.
Topic 2: This topic contains content about protests, controversial political statements attributed to officials, and trade policy. It includes references to the Trans-Pacific Partnership, protesters, and alleged statements from political figures.
Topic 3: This topic appears to focus on claims about crime, election integrity, and controversial allegations. It includes references to “criminal aliens,” hacking claims, and alleged arrests. This seems to capture sensational or contested claims.
Topic 4: This topic appears to cover healthcare policy, taxation, and economic issues. It mentions healthcare, tax credits, stimulus checks, and specific policy proposals like insulin price caps.
Observations:
The topics show some coherence but also overlap considerably. For example, political figures (Obama, Trump) appear across multiple topics.
The model appears to have grouped statements more by subject matter than by other potential organizing principles like veracity or political leaning.
The top documents for each topic show how the LDA model is clustering similar content, though there’s still significant thematic overlap.
The road ahead:
we could consider adjusting the number of topics - The current 5 topics might be too few to capture distinct themes. We could running with 7-10 topics to see if we get clearer separation.
we could BERTopic next - Since our data has nuanced political content, BERTopic might capture more subtle semantic differences that word-frequency based LDA might miss.
🗺️ LDA results visualisation
Before moving to BERTopic, let’s first visualise the results of LDA.
# Prepare visualization
= pyLDAvis.prepare(
panel =lda.components_ / lda.components_.sum(axis=1)[:, None], # normalize
topic_term_dists=lda.transform(X_counts),
doc_topic_dists=(X_counts > 0).sum(axis=1).A1, # number of tokens in each doc
doc_lengths=count_vectorizer.get_feature_names_out(),
vocab=X_counts.sum(axis=0).A1
term_frequency
)
# To display inline in Jupyter Notebook:
pyLDAvis.display(panel)
Topic modeling : Moving beyond LDA to BERTopic
Our first attempt at uncovering thematic structure in political statements used LDA (Latent Dirichlet Allocation) — a classic, word-frequency-based topic modeling approach.
While it provided some insight, it fell short in a few key areas:
🧪 What we observed:
- Topics showed some coherence, but also a lot of semantic overlap — for example, figures like Obama or Trump appeared across multiple topics.
- It grouped statements by surface-level subject matter, but missed deeper patterns (like framing, stance, or veracity).
- Crucially, it required us to choose the number of topics manually, relying on heuristics or trial-and-error.
We even noted that increasing the number of topics (e.g. from 5 to 10) might improve clarity — but that still wouldn’t solve the core issue: LDA doesn’t “understand” meaning.
🚀 Enter BERTopic: Meaning-Aware Topic Modeling
To get richer, more distinct topics, we turned to BERTopic — a modern topic modeling technique that understands context and meaning using neural embeddings and clustering.
Rather than just counting co-occurring words, BERTopic groups statements that are semantically similar, even when they use different language.
🧠 What BERTopic Actually Does
BERTopic combines several powerful components:
1. Transformer-based Embeddings (for meaning)
It starts by turning each statement into a vector embedding — a list of numbers that represents its semantic meaning. This is done using a Transformer-based language model like BERT or MiniLM.
🧠 A Transformer is a type of deep neural network that processes text by attending to the relationships between all words in a sentence, allowing it to understand context far better than older models.
So phrases like: - “Healthcare premiums are rising”
- “Obamacare caused insurance costs to go up”
…are embedded close together, even if they don’t share words.
2. HDBSCAN Clustering (for automatic topic discovery)
Next, it uses HDBSCAN — a clustering algorithm related to DBSCAN — to group similar statements into topics:
- Finds dense clusters of similar texts in embedding space
- Automatically determines the number of topics
- Labels ambiguous or isolated statements as outliers (e.g., topic
-1
)
This avoids the need for LDA-style guesswork about how many topics there should be.
3. c-TF-IDF Scoring (for distinctiveness)
Finally, BERTopic computes class-based TF-IDF (c-TF-IDF) to identify the most distinctive words for each topic:
- It merges all texts in a topic into one virtual document
- Then scores words that are frequent within the topic but rare across others
This yields interpretable keywords that make each topic easier to understand and label.
🔍 LDA vs. BERTopic: A Side-by-Side Comparison
Let’s illustrate how these methods differ using a small example.
Input Statements:
1. Obamacare led to rising healthcare premiums.
2. Health insurance rates have gone up in recent years.
3. Barack Obama announced the policy during his presidency.
LDA
- Based on word co-occurrence.
- Might place (1) and (3) in the same topic due to the word “Obama”.
- May not link (1) and (2), since they don’t share many exact words. Result: Topics influenced by shared surface terms, even if meanings diverge.
BERTopic
- Uses embeddings to understand that (1) and (2) both talk about healthcare costs.
- Understands that (3) is more biographical or political.
- Clusters (1) and (2) together, while (3) might land in a different cluster or become an outlier. Result: Topics grouped by meaning, not just vocabulary.
Model | Grouping Logic | Grouping Result |
---|---|---|
LDA | Groups by word co-occurrence | Might group (1) and (3) together (both say “Obama”) |
BERTopic | Groups by semantic similarity via embeddings | Groups (1) and (2) together (both about healthcare costs) |
🔄 Summary Comparison
Feature | LDA | BERTopic |
---|---|---|
Text representation | Bag-of-Words | Sentence embeddings (semantic) |
Context understanding | ❌ None | ✅ Yes |
Synonym/generalization | ❌ No | ✅ Yes |
Topic count | Must be set manually | Discovered automatically with HDBSCAN |
Topic distinctiveness | Frequency-based | c-TF-IDF (topic-specific relevance) |
Outlier handling | ❌ No | ✅ Yes (noise is excluded) |
Visualization | Requires custom setup | ✅ Built-in and interactive |
💡 Why BERTopic Makes Sense for This Dataset
Political statements are: - Short - Often nuanced or rhetorically loaded - Use indirect phrasing or synonyms
This makes surface-level word clustering (like in LDA) brittle. BERTopic’s semantic approach is far better suited — helping us uncover meaningful, well-separated themes grounded in how people actually use language.
# Step 1: Extract raw statements
= liar_clean["statement"].tolist()
texts
# Optional: Specify embedding model (MiniLM is a good default)
= SentenceTransformer("all-MiniLM-L6-v2",cache_folder=os.getcwd())
embedding_model
# Step 2: Create BERTopic model with embedding model
= BERTopic(embedding_model=embedding_model, verbose=True)
topic_model
# Step 3: Fit model on raw statements
= topic_model.fit_transform(texts)
topics, probs
# Step 4: Explore topics
=topic_model.get_topic_info()
all_topics= all_topics.shape[0] - 1 # Subtract 1 to exclude the "-1" outlier topic
n_topics print(f"Number of topics found (excluding outliers): {n_topics}")
# OR:
print(f"Words for topic 0", topic_model.get_topic(0)) # Get words for topic 0
# Step 5: Visualize topics (interactively)
print("Visualising the top 10 topics")
topic_model.visualize_topics()
=10) topic_model.visualize_barchart(top_n_topics
topic_model.visualize_heatmap()
= topic_model.get_topic_info()
topic_info topic_info
Topic | Count | Name | Representation | Representative_Docs | |
---|---|---|---|---|---|
0 | -1 | 1894 | -1_the_to_says_in | [the, to, says, in, of, and, for, is, on, that] | [There was no effort” to get American experts … |
1 | 0 | 485 | 0_health_care_obamacare_medicare | [health, care, obamacare, medicare, insurance,… | [Wisconsin has “the number one health care sys… |
2 | 1 | 348 | 1_covid_19_vaccine_coronavirus | [covid, 19, vaccine, coronavirus, vaccines, va… | [Video shows that certain drinks can “test pos… |
3 | 2 | 305 | 2_election_vote_voter_ballots | [election, vote, voter, ballots, votes, voting… | [In Iowa, “since we have put a number of the v… |
4 | 3 | 208 | 3_school_schools_education_public | [school, schools, education, public, graduatio… | [Mayor Fung wants to punish our children’s edu… |
… | … | … | … | … | … |
68 | 67 | 11 | 67_formula_baby_shortage_obesity | [formula, baby, shortage, obesity, current, ki… | [Due to the baby formula shortage Tricare will… |
69 | 68 | 11 | 68_fund_rainy_budget_budgets | [fund, rainy, budget, budgets, governor, day, … | [As state Commerce secretary, Mary Burke draft… |
70 | 69 | 11 | 69_gender_old_consent_boy | [gender, old, consent, boy, school, trans, gir… | [Assertions that it makes no difference whethe… |
71 | 70 | 11 | 70_virginia_west_virginians_moving | [virginia, west, virginians, moving, average, … | [West Virginia added the highest percentage of… |
72 | 71 | 10 | 71_jefferson_thomas_written_government | [jefferson, thomas, written, government, restr… | [Thomas Jefferson said, “That government is be… |
73 rows × 5 columns
BERTopic finds 74 topics (aside from outliers)! The distribution shows a long tail pattern typical of news/political discourse where a few major issues dominate but many niche topics exist: the predominant topics seem to be healthcare/Obamacare/Medicare (topic 0), Covid 19/vaccines (topic 1), elections/voting - clearly tied to voting or election fraud narratives (topic 2), education/schools (topic 3) and some niche topics are NFL Anthem protests or Carolina North HB21 BERTopic also finds many outliers (there are more outliers than there are statements in the top topic!), either illustrating the diversity od topics and inherent noisiness of the data or suggesting that we might need to consider different embeddings or optimize the minimum topic size parameter for BERTopic to better capture the data structure.
Just as with LDA, we want to see which words are more representative of each topic and which are the top statements associated with each topic.
def get_representative_docs(topic_model, texts, topics, probs, n=5):
"""
Get the most representative documents for each topic.
Parameters:
-----------
topic_model : BERTopic model
The fitted BERTopic model
texts : list
List of original text documents
topics : list
List of topic assignments for each document
probs : list
List of probability distributions for each document
n : int, optional
Number of representative documents to return per topic
Returns:
--------
dict
Dictionary mapping topic IDs to lists of representative documents
"""
= {}
topic_to_docs
# Exclude outlier topic (-1) and get unique topics
= set(topics)
unique_topics if -1 in unique_topics:
-1)
unique_topics.remove(
for topic_id in unique_topics:
# Get document indices for this topic
= [i for i, t in enumerate(topics) if t == topic_id]
indices
# Skip if no documents in this topic
if not indices:
continue
# Get probabilities for these documents
# Each document's probability of belonging to this topic
= []
topic_probs for i in indices:
# Find the probability for this topic
# The structure depends on how BERTopic returns probabilities
if isinstance(probs[i], list) and len(probs[i]) > 0:
if isinstance(probs[i][0], tuple):
# If probs is a list of tuples like [(topic_id, prob), ...]
for t, p in probs[i]:
if t == topic_id:
topic_probs.append(p)break
else:
0.0) # Topic not found in probabilities
topic_probs.append(else:
# If probs is a list of probabilities in order of topics
try:
= list(topic_model.topic_mapper_.index_to_topic_.values()).index(topic_id)
topic_idx
topic_probs.append(probs[i][topic_idx])except:
0.0) # Topic not found in mapping
topic_probs.append(else:
0.0) # No probability data
topic_probs.append(
# If we couldn't get reliable probabilities, use all documents
if all(p == 0.0 for p in topic_probs):
= indices[:n]
sorted_indices else:
# Sort by probability (highest first) and get top n
= [indices[i] for i in np.argsort(topic_probs)[::-1][:n]]
sorted_indices
# Store documents
= [texts[i] for i in sorted_indices]
topic_to_docs[topic_id]
return topic_to_docs
def display_topics_with_docs(topic_model, texts, topics, probs, n_docs=3, n_words=10):
"""
Display topics with their representative documents.
Parameters:
-----------
topic_model : BERTopic model
The fitted BERTopic model
texts : list
List of original text documents
topics : list
List of topic assignments for each document
probs : list
List of probability distributions for each document
n_docs : int, optional
Number of representative documents to show per topic
n_words : int, optional
Number of words to show for each topic
"""
# Get topic information
= topic_model.get_topic_info()
topic_info
# Get representative documents
= get_representative_docs(topic_model, texts, topics, probs, n=n_docs)
rep_docs
# Create a clean display
print(f"{'='*80}")
print(f"{'TOPIC SUMMARY':^80}")
print(f"{'='*80}")
# Display info for each topic, sorted by size (excluding -1)
for _, row in topic_info[topic_info['Topic'] != -1].iterrows():
= row['Topic']
topic_id = row['Name']
topic_name = row['Count']
topic_size
# Get words for this topic
= topic_model.get_topic(topic_id)
topic_words = ", ".join([f"{word} ({score:.3f})" for word, score in topic_words[:n_words]])
words_str
print(f"\nTOPIC {topic_id}: {topic_name} ({topic_size} documents)")
print(f"Keywords: {words_str}")
# Display representative documents
if topic_id in rep_docs:
print("\nRepresentative documents:")
for i, doc in enumerate(rep_docs[topic_id]):
# Limit document length for display
= doc[:200] + "..." if len(doc) > 200 else doc
display_doc print(f"{i+1}. {display_doc}")
else:
print("\nNo representative documents found for this topic.")
print(f"{'-'*80}")
# Display info for outlier topic if it exists
if -1 in topics:
= topics.count(-1)
outlier_count print(f"\nOUTLIER TOPIC: -1 ({outlier_count} documents)")
print("Documents that don't fit well into any topic")
print(f"{'-'*80}")
# Example usage:
# display_topics_with_docs(topic_model, texts, topics, probs)
# To save the results to a file:
def save_topics_with_docs(topic_model, texts, topics, probs, n_docs=5, n_words=10, filename="topic_results.md"):
"""Save topics with their representative documents to a markdown file."""
with open(filename, "w") as f:
"# BERTopic Results\n\n")
f.write(
# Get topic information
= topic_model.get_topic_info()
topic_info
# Get representative documents
= get_representative_docs(topic_model, texts, topics, probs, n=n_docs)
rep_docs
# Display info for each topic, sorted by size (excluding -1)
for _, row in topic_info[topic_info['Topic'] != -1].iterrows():
= row['Topic']
topic_id = row['Name']
topic_name = row['Count']
topic_size
# Get words for this topic
= topic_model.get_topic(topic_id)
topic_words = ", ".join([f"{word} ({score:.3f})" for word, score in topic_words[:n_words]])
words_str
f"\n## TOPIC {topic_id}: {topic_name} ({topic_size} documents)\n\n")
f.write(f"**Keywords:** {words_str}\n\n")
f.write(
# Display representative documents
if topic_id in rep_docs:
"### Representative documents:\n\n")
f.write(for i, doc in enumerate(rep_docs[topic_id]):
f"{i+1}. {doc}\n\n")
f.write(else:
"No representative documents found for this topic.\n\n")
f.write(
"---\n") f.write(
=3, n_words=8)
display_topics_with_docs(topic_model, texts, topics, probs, n_docs
# Save results to a markdown file for easier sharing/analysis
=5, n_words=10,
save_topics_with_docs(topic_model, texts, topics, probs, n_docs="liar_topics_analysis.md") filename
================================================================================
TOPIC SUMMARY ================================================================================
0: 0_health_care_obamacare_medicare (485 documents)
TOPIC 0.038), care (0.036), obamacare (0.027), medicare (0.026), insurance (0.021), social (0.015), medicaid (0.014), security (0.014)
Keywords: health (
Representative documents:1. A small business receiving a federal tax credit "to provide health care for its employees ... would be barred under the Stupak Amendment from allowing their employees to use ... the health care that t...
2. Says Mark Udall "decided Obamacare. ... He passed Obamacare with his vote.
3. Says Bernie Sanders once said, "Instead of removing the conditions that make people depressed, modern society gives them antidepressant drugs.
--------------------------------------------------------------------------------
1: 1_covid_19_vaccine_coronavirus (348 documents)
TOPIC 0.060), 19 (0.048), vaccine (0.033), coronavirus (0.026), vaccines (0.019), vaccinated (0.016), people (0.014), are (0.014)
Keywords: covid (
Representative documents:1. Video shows that certain drinks can "test positive" for COVID-19 when using at-home COVID-19 tests.
2. The AstraZeneca COVID-19 vaccine contains a Bluetooth microchip.
3. Says drinking a "6-ounce glass of tonic water each night before bed" will help alleviate restless legs syndrome.
--------------------------------------------------------------------------------
2: 2_election_vote_voter_ballots (305 documents)
TOPIC 0.029), vote (0.026), voter (0.022), ballots (0.021), votes (0.019), voting (0.018), voters (0.017), in (0.016)
Keywords: election (
Representative documents:1. Canadian-born Ted Cruz "has had a double passport.
2. In certain swing states, there were more votes than people who voted, and in big numbers.
3. Electronic voting machines didn't allow people to vote" in Maricopa County, Arizona.
--------------------------------------------------------------------------------
3: 3_school_schools_education_public (208 documents)
TOPIC 0.043), schools (0.035), education (0.031), public (0.021), graduation (0.020), college (0.020), students (0.019), funding (0.018)
Keywords: school (
Representative documents:1. Richmond teachers have not gotten a raise "for almost 10 years.
2. Inner-city students "fall back and don't succeed, whereas if there was a less intensive track, they would.
3. Says "Gary Johnson and the Libertarian Party want to get rid of the public school system.
--------------------------------------------------------------------------------
4: 4_obama_barack_iran_president (173 documents)
TOPIC 0.046), barack (0.033), iran (0.023), president (0.023), nuclear (0.019), muslim (0.016), to (0.015), said (0.014)
Keywords: obama (
Representative documents:1. Barack Obama: I'm 'the First Sitting American President to Come From Kenya.'
2. Russia and China "absolutely said pretty clearly" they would not keep economic sanctions on Iran if the United States "walked away from the deal.
3. Says Barack Obama is "trying to slash funding for the Armed Pilots Program designed to prevent terror attacks.
--------------------------------------------------------------------------------
5: 5_debt_trillion_deficit_spending (126 documents)
TOPIC 0.078), trillion (0.042), deficit (0.032), spending (0.028), national (0.027), budget (0.022), federal (0.019), the (0.018)
Keywords: debt (
Representative documents:1. Sheldon Whitehouse's failure of leadership, the adding of almost $8 trillion in debt in his first full term, is shameful...
2. We have a national debt the size of our nation's economy for the first time since World War II.
3. The largest contributor to the debt ceiling, or to our deficit, has been the Trump tax cuts.
--------------------------------------------------------------------------------
6: 6_abortion_parenthood_planned_abortions (110 documents)
TOPIC 0.074), parenthood (0.044), planned (0.041), abortions (0.038), women (0.024), birth (0.023), baby (0.022), pro (0.021)
Keywords: abortion (
Representative documents:1. Planned Parenthood "is an organization that funnels millions of dollars in political contributions to pro-abortion candidates.
2. The House health care bill provides for "free abortion services, and probably forced participation in abortions by members of the medical profession.
3. Says he didn't call Dr. George Tiller a baby killer, as liberal groups charge, but was merely reporting what "some prolifers branded him.
--------------------------------------------------------------------------------
7: 7_crime_gun_violent_killed (109 documents)
TOPIC 0.046), gun (0.029), violent (0.026), killed (0.026), rate (0.023), in (0.022), violence (0.018), murder (0.018)
Keywords: crime (
Representative documents:1. Most of your serial killers, most of your people who commit domestic violence, they start off by abusing animals.
2. There are places in America that are among the most dangerous in the world. You go to places like Oakland. Or Ferguson. The crime numbers are worse. Seriously.
3. Says the manager of a Checkers in Miami Gardens, Fla., killed 10 employees.
--------------------------------------------------------------------------------
8: 8_jobs_created_job_000 (108 documents)
TOPIC 0.082), created (0.045), job (0.031), 000 (0.029), ohio (0.028), sector (0.024), since (0.021), more (0.018)
Keywords: jobs (
Representative documents:1. We've had the same leadership in the Travis County DA's office for 30 years.
2. There was a "loss of eight million jobs during the Bush eight years.
3. There are giant human caucasian skeletons in Ohio that predate any known civilizations.
--------------------------------------------------------------------------------
9: 9_biden_joe_video_shows (100 documents)
TOPIC 0.140), joe (0.113), video (0.046), shows (0.035), hunter (0.026), president (0.023), his (0.023), he (0.017)
Keywords: biden (
Representative documents:1. Joe Biden's wide open southern border" is to blame for fentanyl deaths in the U.S.
2. Says Jill Biden said, "Whatever is on Hunter's laptop is not important.
3. There is no Biden presidency. The real Biden was executed for his crimes long ago along with Clinton.
--------------------------------------------------------------------------------
10: 10_gun_background_checks_guns (79 documents)
TOPIC 0.068), background (0.053), checks (0.037), guns (0.033), nra (0.026), you (0.026), ban (0.024), firearms (0.024)
Keywords: gun (
Representative documents:1. California's Proposition 63 would "CRIMINALIZE the sharing of ammunition between friends who may be hunting or shooting together.
2. The New START treaty was passed despite significant concerns among some people in the State and Defense Departments.
3. A law that mandates fingerprinting for gun purchasers is "a requirement that's reduced gun crimes in the five states where it's the law.
--------------------------------------------------------------------------------
11: 11_border_immigration_mexico_illegal (70 documents)
TOPIC 0.092), immigration (0.029), mexico (0.027), illegal (0.024), patrol (0.022), secure (0.020), we (0.019), illegally (0.018)
Keywords: border (
Representative documents:1. Says President Donald Trump said asylum seekers "aren't people. These are animals.
2. The Obama administration "manipulated deportation data to make it appear that the Border Patrol was deporting more illegal immigrants than the Bush administration.
3. A lot of these migrants that are coming, we have no way to screen their backgrounds for either health or for security.
--------------------------------------------------------------------------------
12: 12_climate_change_warming_ice (60 documents)
TOPIC 0.067), change (0.049), warming (0.046), ice (0.041), global (0.034), earth (0.032), hoax (0.023), emissions (0.022)
Keywords: climate (
Representative documents:1. Asteroid warning: NASA tracks a 4KM killer rock on approach - Could end human civilisation.
2. Carbon dioxide is not "a primary contributor to the global warming that we see.
3. We need to also recognize that we just had two of the coldest years, biggest drop in global temperatures that we have had since the 1980s, the biggest in the last 100 years. We don't talk about that.
--------------------------------------------------------------------------------
13: 13_ukraine_russian_video_moon (58 documents)
TOPIC 0.096), russian (0.054), video (0.052), moon (0.047), putin (0.045), shows (0.038), landing (0.036), ufo (0.031)
Keywords: ukraine (
Representative documents:1. Photo shows a Russian tank Ukrainians are selling on eBay.
2. NASA Curiosity finds Nazi helmet on Mars.
3. A video shows a real angel in the sky in Palestine.
--------------------------------------------------------------------------------
14: 14_taxes_tax_kaine_tim (57 documents)
TOPIC 0.086), tax (0.067), kaine (0.052), tim (0.049), increase (0.043), raise (0.042), raised (0.029), voted (0.023)
Keywords: taxes (
Representative documents:1. Says Wendy's founder "Dave Thomas left the state literally on his deathbed to avoid the estate tax.
2. Says Jeanne Shaheen "voted to pave the way for a new national energy tax.
3. Says U.S. Senate rival Tammy Baldwin supports a "tax increase that President Obama says" would cost middle-class families $3,000 per year.
--------------------------------------------------------------------------------
15: 15_tax_taxes_cuts_property (53 documents)
TOPIC 0.085), taxes (0.030), cuts (0.025), property (0.023), relief (0.023), cut (0.019), republican (0.019), irs (0.018)
Keywords: tax (
Representative documents:1. If you or your family spends $28 a day — prepare to be under constant audit by the IRS.
2. We've got a personal tax system that's so complicated it costs Americans about $500 billion a year to comply with the current tax code.
3. GDP is booming. Let's spread the prosperity to the middle class with big #TaxCuts. Our #TaxReform package doubles standard deduction, brings $$$ back home, and reduces rates for ALL taxpayers. We will...
--------------------------------------------------------------------------------
16: 16_photo_shows_epstein_jeffrey (51 documents)
TOPIC 0.107), shows (0.074), epstein (0.044), jeffrey (0.038), photos (0.035), posing (0.034), michelle (0.028), shirt (0.028)
Keywords: photo (
Representative documents:1. Photo shows Queen Elizabeth II being carried across a river by African porters.
2. Photo shows former President Donald Trump "dancing with a 13-year-old girl" on Jeffrey Epstein's private island.
3. Says Nancy Pelosi and Beto O'Rourke were photographed with El Chapo.
--------------------------------------------------------------------------------
17: 17_pelosi_nancy_kamala_harris (48 documents)
TOPIC 0.102), nancy (0.086), kamala (0.048), harris (0.047), schiff (0.039), resign (0.037), adam (0.032), video (0.032)
Keywords: pelosi (
Representative documents:1. Kamala Harris resigns in shock after Biden's 'mandatory' border trip…Trump jails 'corrupt' VP.
2. Democrats Filing TREASON Charges Against Trump, McConnell, Giuliani, & Comey.
3. Pelosi sinks to new low, tells Dems: if you have to lie to voters to win, do it.
--------------------------------------------------------------------------------
18: 18_clinton_hillary_she_emails (47 documents)
TOPIC 0.104), hillary (0.097), she (0.042), emails (0.038), fbi (0.033), her (0.032), email (0.029), laughing (0.024)
Keywords: clinton (
Representative documents:1. Says "Hillary Clinton filed for divorce In New York courts.
2. The FBI has reopened its investigation into Hillary Clinton ... and discovered another 650,000 emails.
3. Says Hillary Clinton is "wrong" to say he mocked a disabled reporter.
--------------------------------------------------------------------------------
19: 19_coal_solar_energy_electric (47 documents)
TOPIC 0.081), solar (0.064), energy (0.053), electric (0.042), power (0.039), wind (0.032), gas (0.024), electricity (0.024)
Keywords: coal (
Representative documents:1. Puerto Rico's electric grid and their electric generating plant was dead before the storms ever hit. It was in very bad shape. It was in bankruptcy. It had no money. It was largely -- you know, it was...
2. China will be allowed to build hundreds of additional coal plants. So we can't build the plants, but they can, according to this (Paris) agreement.
3. Energy nominee Steven Chu has called coal "his worst nightmare.
--------------------------------------------------------------------------------
20: 20_wisconsin_income_growth_walker (45 documents)
TOPIC 0.139), income (0.032), growth (0.031), walker (0.026), scott (0.022), player (0.021), midwest (0.019), jobs (0.019)
Keywords: wisconsin (
Representative documents:1. Wisconsin is not as obese as the national average is. The national average, 35.7 percent of the population is obese. In Wisconsin, it's closer to 26 percent.
2. The sale of more than 600,000 Wisconsin deer licenses shows that Wisconsin's hunters are the "eighth largest army in the world.
3. Wisconsin is "dead last in income growth" among midwestern states during Gov. Scott Walker's term.
--------------------------------------------------------------------------------
21: 21_wage_minimum_hour_women (44 documents)
TOPIC 0.084), minimum (0.068), hour (0.049), women (0.048), wages (0.047), cents (0.041), average (0.038), men (0.037)
Keywords: wage (
Representative documents:1. Under Barack Obama and the big government economy, the median wage for women has dropped $733.
2. Raising the national minimum wage to $15 per hour "would destroy up to 3.7 million jobs.
3. A Congressional Budget Office report says President Barack Obama's executive order to raise the minimum wage for new federal contract workers "will cost the economy 500,000 jobs.
--------------------------------------------------------------------------------
22: 22_money_campaign_million_pac (44 documents)
TOPIC 0.035), campaign (0.032), million (0.030), pac (0.030), foundation (0.025), clinton (0.024), trump (0.023), donors (0.021)
Keywords: money (
Representative documents:1. During "the recent global warming summit in Copenhagen, Nancy Pelosi and others stayed at a five-star hotel on a trip costing nearly $10,000 per person.
2. Says Sen. Elizabeth Warren "lives in a $5.4 million mansion.
3. Says his campaign hasn't accepted corporate PAC money and isn't funded by "special interests.
--------------------------------------------------------------------------------
23: 23_mccain_john_he_against (42 documents)
TOPIC 0.123), john (0.058), he (0.045), against (0.027), equal (0.026), said (0.023), negative (0.022), troops (0.022)
Keywords: mccain (
Representative documents:1. Says Claire McCaskill "voted against including health records in the background check system.
2. We haven't yet taken out a negative ad.
3. Says McCain was a "Hanoi Hilton songbird" who collaborated with the enemy.
--------------------------------------------------------------------------------
24: 24_austin_texas_city_residents (38 documents)
TOPIC 0.087), texas (0.080), city (0.051), residents (0.028), mayor (0.027), antonio (0.027), says (0.024), council (0.023)
Keywords: austin (
Representative documents:1. Says Texas Rep. Terry Meza said changing the castle doctrine is meant to create a "peaceful" transfer of property from victims to thieves
2. Austin is "effectively" imposing "a ban on barbecue restaurants.
3. Russia has more diversity in its governmental representation than we do in Texas.
--------------------------------------------------------------------------------
25: 25_fbi_espionage_investigation_waterboarding (37 documents)
TOPIC 0.054), espionage (0.035), investigation (0.027), waterboarding (0.027), warrant (0.025), james (0.023), news (0.023), spy (0.020)
Keywords: fbi (
Representative documents:1. The White House fully cooperated with the Special Counsel's investigation.
2. The FBI said (former national security adviser Michael Flynn) wasn't lying.
3. Mark Kirk's claim of national security expertise is "based on his military record which he lied about at least 10 times.
--------------------------------------------------------------------------------
26: 26_poverty_homeownership_wealth_rate (36 documents)
TOPIC 0.109), homeownership (0.036), wealth (0.036), rate (0.035), poor (0.034), black (0.034), african (0.033), white (0.029)
Keywords: poverty (
Representative documents:1. 70% of Americans in poverty are white.
2. Says California has "30% or a third of the country's people living at or below the poverty level.
3. Black Friday originated in 1904 when enslaved people were sold at a discount to boost the economy.
--------------------------------------------------------------------------------
27: 27_unemployment_rate_black_lowest (35 documents)
TOPIC 0.204), rate (0.070), black (0.037), lowest (0.036), unemployed (0.036), sees (0.030), percent (0.028), been (0.028)
Keywords: unemployment (
Representative documents:1. Says the Congressional Budget Office said "unemployment could top 9 percent in 2013
2. Hispanic unemployment has been ticking down from an all-time high of 13.9 percent because of the policies we've implemented.
3. Chicago maintained the lowest unemployment rate through most of 2021 of any large American city.
--------------------------------------------------------------------------------
28: 28_stimulus_checks_projects_ants (34 documents)
TOPIC 0.182), checks (0.050), projects (0.030), ants (0.029), package (0.029), went (0.029), spent (0.027), alaska (0.025)
Keywords: stimulus (
Representative documents:1. Says Steve Jobs was responsible for creating more jobs than the stimulus bill.
2. Catherine Cortez Masto "was the deciding vote" to send COVID-19 stimulus checks to more than 1 million prisoners.
3. $2,000 stimulus checks were due on Jan. 19.
--------------------------------------------------------------------------------
29: 29_antifa_matter_lives_school (32 documents)
TOPIC 0.144), matter (0.042), lives (0.041), school (0.040), parents (0.036), shooter (0.036), terrorist (0.035), black (0.032)
Keywords: antifa (
Representative documents:1. Black Lives Matter and antifa set Catholic church on fire in Minneapolis.
2. The Nashville, Tennessee, school shooter was Samantha Hyde.
3. 12 white female bodies in garage freezer tagged 'Black Lives Matter.'
--------------------------------------------------------------------------------
30: 30_oil_gallon_gas_pump (32 documents)
TOPIC 0.125), gallon (0.048), gas (0.047), pump (0.036), cents (0.036), prices (0.034), energy (0.033), russia (0.032)
Keywords: oil (
Representative documents:1. Soybeans dropped 70 cents and corn 50 cents per bushel in two days, causing farmers to lose big money, "thanks to Biden's executive orders.
2. We have doubled our (oil) imports from Russia in the last year.
3. Ron Klein "sponsored an amendment that specifically allows price gouging.
--------------------------------------------------------------------------------
31: 31_food_milk_fast_fda (31 documents)
TOPIC 0.079), milk (0.044), fast (0.041), fda (0.034), store (0.028), shortage (0.027), salmonella (0.023), eggs (0.023)
Keywords: food (
Representative documents:1. Not one illness has been reported from raw milk in" Texas "in more than four years. A total of six illnesses have occurred in the last 20 years.
2. McDonald's uses potatoes sprayed with a highly toxic pesticide called Monitor.
3. The current (agriculture) commissioner allowed tainted beef to be sent to school cafeterias.
--------------------------------------------------------------------------------
32: 32_defense_aid_afghanistan_spend (30 documents)
TOPIC 0.067), aid (0.043), afghanistan (0.041), spend (0.039), military (0.038), gives (0.037), israel (0.035), spent (0.034)
Keywords: defense (
Representative documents:1. We spend three times more on entitlements and debt services than we do on defense.
2. We have one of the most expensive General Assemblies, per capita, in the entire country.
3. We have spent $7 trillion, trillion with a T, $7 trillion in the Middle East.
--------------------------------------------------------------------------------
33: 33_capitol_protesters_protesting_protest (30 documents)
TOPIC 0.072), protesters (0.059), protesting (0.057), protest (0.051), riots (0.047), shows (0.046), riot (0.035), peaceful (0.035)
Keywords: capitol (
Representative documents:1. The Simpsons predicted the Capitol riots
2. A photo shows two men on the run who pretend to be homeless and then "attack and rob you.
3. Protest vandalism in Richmond "all started two weeks ago when Mayor Stoney's police gassed - tear gassed - a peaceful crowd of protesters, moms and children, at 7:30 p.m. sharp.
--------------------------------------------------------------------------------
34: 34_jersey_new_jobs_states (29 documents)
TOPIC 0.111), new (0.077), jobs (0.034), states (0.029), state (0.029), sector (0.023), growth (0.022), york (0.020)
Keywords: jersey (
Representative documents:1. Because of its higher minimum wage, New York State is "already showing signs of various companies picking up and leaving.
2. Columbia, Mo., is "one of the fastest growing cities in the region, the only city in the state to grow jobs faster than the nation as a whole.
3. This decision is an unfortunate example of why those of us in Montpelier need to work together to make Vermont a more affordable place to do business and make sure our policies help businesses thrive ...
--------------------------------------------------------------------------------
35: 35_water_gallons_loophole_waters (29 documents)
TOPIC 0.135), gallons (0.056), loophole (0.028), waters (0.028), drinking (0.027), saved (0.025), clean (0.025), flow (0.023)
Keywords: water (
Representative documents:1. Onion Creek's highest flow rate" on a recent night of flooding "was 120,000 cubic feet per second, which is nearly double the average flow rate of Niagara Falls.
2. The Atlanta area's water planning district "is now the national leader in conservation.
3. The train that derailed in Ohio was carrying "over 300,000 gallons of a chemical that was banned in 1974.
--------------------------------------------------------------------------------
36: 36_income_pay_tax_taxes (28 documents)
TOPIC 0.094), pay (0.077), tax (0.076), taxes (0.049), earners (0.047), percent (0.036), taxed (0.036), americans (0.031)
Keywords: income (
Representative documents:1. The top 1 percent of income earners pay 40 percent of all state income taxes, and those at the bottom pay little or nothing.
2. We're the most highly taxed nation in the world.
3. As a U.S. senator with salary and perks, #Bernie is in the top 1 percent, but still paid only 13 percent in taxes last year.
--------------------------------------------------------------------------------
37: 37_marriage_equality_miss_marry (27 documents)
TOPIC 0.100), equality (0.045), miss (0.040), marry (0.040), married (0.035), husband (0.033), donald (0.032), webster (0.028)
Keywords: marriage (
Representative documents:1. Says you can determine a bell pepper's gender by its "bumps.
2. Kavanaugh accuser's husband breaks his silence, exposes 'sick issue' his wife has.
3. Husbands rarely beat up their wives. Single women get beaten up more.
--------------------------------------------------------------------------------
38: 38_romney_mitt_massachusetts_bain (27 documents)
TOPIC 0.187), mitt (0.167), massachusetts (0.061), bain (0.042), capital (0.031), bankruptcy (0.026), maxwell (0.025), plant (0.023)
Keywords: romney (
Representative documents:1. Says Romney and Bain Capital drove KB Toys into bankruptcy by loading it up with debt.
2. Epstein's recruiter Ghislaine Maxwell's father Robert Maxwell gave Mitt Romney his first $2M investment and started him in business.
3. Mitt Romney "supervised a company guilty of massive Medicare fraud.
--------------------------------------------------------------------------------
39: 39_marijuana_medical_legalizing_use (27 documents)
TOPIC 0.217), medical (0.059), legalizing (0.039), use (0.036), pot (0.035), teenager (0.029), smoking (0.029), recreational (0.029)
Keywords: marijuana (
Representative documents:1. Twenty-five percent of our kids in foster care are there because their parents are involved in drugs.
2. In one Colorado hospital, 50 percent of newborns tested had marijuana in their system.
3. Medical-grade marijuana alone will not get that patient 'high,' no matter what level of THC, CBD or any other compound is found in the plant.
--------------------------------------------------------------------------------
40: 40_road_lanes_milwaukee_built (26 documents)
TOPIC 0.059), lanes (0.057), milwaukee (0.054), built (0.038), bridges (0.034), wi (0.033), rail (0.033), public (0.032)
Keywords: road (
Representative documents:1. I took on the worst road system in the country, according to Trucker's magazine. When I left, they said it was the most improved road system in the country.
2. Fifteen years ago, you couldn't even get a pizza delivered in that neighborhood.
3. We don't have bridges being built" in the United States.
--------------------------------------------------------------------------------
41: 41_ocasio_alexandria_cortez_communism (26 documents)
TOPIC 0.053), alexandria (0.053), cortez (0.052), communism (0.040), marxist (0.040), socialists (0.040), democratic (0.039), socialist (0.038)
Keywords: ocasio (
Representative documents:1. Say New Jersey Reps. Donald Payne and Frank Pallone are "socialists who are openly serving in the U.S. Congress.
2. Says Alexandria Ocasio-Cortez tweeted, "Let's begin 'The Purge' to roundup all Conservative traitors.
3. Says Bernie Sanders "collaborated with Marxist regimes in the Soviet Union, Nicaragua and Cuba.
--------------------------------------------------------------------------------
42: 42_court_supreme_justices_election (24 documents)
TOPIC 0.112), supreme (0.103), justices (0.056), election (0.047), confirmed (0.046), nominees (0.043), judges (0.040), bench (0.038)
Keywords: court (
Representative documents:1. It has been 80 years since a Supreme Court vacancy was nominated and confirmed in an election year. There is a long tradition that you don't do this in an election year.
2. We've got four Supreme Court justices who … signed their name to a declaration that Americans have no fundamental right to self-defense.
3. Cheri Beasley "vacated" a man's death sentence and "threw out" the indictment in a child assault case.
--------------------------------------------------------------------------------
43: 43_perry_rick_texas_governor (23 documents)
TOPIC 0.219), rick (0.176), texas (0.040), governor (0.038), gov (0.036), binational (0.034), forum (0.030), lobbyists (0.029)
Keywords: perry (
Representative documents:1. Says Rick Perry and the Republican-controlled Texas Legislature "managed to slash the budget of the volunteer fire departments in our state by 75 percent in the last legislative session," reducing the...
2. Gov. Rick Perry "has overseen the highest Texas unemployment in 22 years.
3. Says Rick Perry is "spending more money than the state takes in, covering his deficits with record borrowing.
--------------------------------------------------------------------------------
44: 44_barack_obama_president_deported (23 documents)
TOPIC 0.077), obama (0.065), president (0.060), deported (0.059), immigration (0.056), authority (0.051), executive (0.041), deportation (0.037)
Keywords: barack (
Representative documents:1. President Barack Obama has the "power to stop deportation for all undocumented immigrants in this country.
2. President Obama said "22 times" he could not authorize immigration reform by executive order.
3. Latina who enthusiastically supported Donald Trump on stage in Las Vegas in October 2015 has been deported.
--------------------------------------------------------------------------------
45: 45_never_worked_obama_barack (22 documents)
TOPIC 0.128), worked (0.059), obama (0.048), barack (0.040), encouraged (0.039), president (0.039), business (0.036), no (0.035)
Keywords: never (
Representative documents:1. Says Barack Obama "is the first president in modern history not to have a single year of 3 percent growth.
2. There was no panhandling when I left office.
3. 92 percent" of President Barack Obama's administration has "never worked outside government.
--------------------------------------------------------------------------------
46: 46_drug_heroin_overdose_drugs (22 documents)
TOPIC 0.068), heroin (0.057), overdose (0.057), drugs (0.053), overdoses (0.047), throw (0.044), death (0.036), fentanyl (0.031)
Keywords: drug (
Representative documents:1. Says police are warning shoppers to use wipes to clean their carts over deadly risk of fentanyl residue.
2. So now they take drugs, literally, and they throw it, a hundred pounds of drugs. They throw it over the wall, they have catapults, but they throw it over the wall, and it lands and it hits somebody on...
3. 100% of heroin/fentanyl epidemic is because we don't have a WALL.
--------------------------------------------------------------------------------
47: 47_african_men_prison_whites (22 documents)
TOPIC 0.113), men (0.064), prison (0.054), whites (0.052), arrested (0.047), blacks (0.047), likely (0.046), americans (0.046)
Keywords: african (
Representative documents:1. ''Over 40 percent of African-American men in (Milwaukee County) have been going to prison for low-level drug offenses, and I think a hundred times that in the city of Madison.
2. In the 513 days between Trayvon dying, and today's verdict, 11,106 African-Americans have been murdered by other African-Americans.
3. There are more young black males involved in the criminal justice system than there are in higher education.
--------------------------------------------------------------------------------
48: 48_drilling_biden_offshore_pipeline (20 documents)
TOPIC 0.112), biden (0.062), offshore (0.059), pipeline (0.052), oil (0.052), shale (0.045), crack (0.045), joe (0.044)
Keywords: drilling (
Representative documents:1. A Costco gas pump screen said "Don't blame us. Blame Joe Biden.
2. NASA just announced a 100-foot-wide fissure-crack just opened up Yellowstone volcano in 24 hours.
3. I have always said that I would be for drilling.
--------------------------------------------------------------------------------
49: 49_food_stamps_stamp_farm (19 documents)
TOPIC 0.183), stamps (0.113), stamp (0.074), farm (0.066), benefits (0.046), assistance (0.038), farmers (0.033), representatives (0.033)
Keywords: food (
Representative documents:1. Forty-three million Americans are on food stamps.
2. Says Amazon is "paying full-time employees so little that they require gov food assistance.
3. Says "President Obama hijacked the farm bill (and) turned it into a food stamp bill.
--------------------------------------------------------------------------------
50: 50_bills_passed_introduced_congress (18 documents)
TOPIC 0.129), passed (0.053), introduced (0.048), congress (0.034), democrats (0.032), house (0.029), bill (0.026), stage (0.025)
Keywords: bills (
Representative documents:1. Says this House processed and approved as many bills as previous Legislatures.
2. Democrats have said that "even one hearing (on the stimulus bill) would be one too many, and that we have a single day to approve these five complex propositions that will affect the lives of millions...
3. Says Democrats "have issued more subpoenas than they've passed bills.
--------------------------------------------------------------------------------
51: 51_trade_united_china_states (18 documents)
TOPIC 0.064), united (0.060), china (0.055), states (0.049), car (0.048), billion (0.047), cars (0.044), tariffs (0.039)
Keywords: trade (
Representative documents:1. The United States has a massive trade deficit with Japan. It's anywhere from $69 billion to a $100 billion a year.
2. The European Union … they send us Mercedes, they send us -- by the millions -- the BMWs -- cars by the millions.
3. General Motors is sending Mexican made model of Chevy Cruze to U.S. car dealers-tax free across border.
--------------------------------------------------------------------------------
52: 52_texas_jobs_created_lost (17 documents)
TOPIC 0.103), jobs (0.092), created (0.082), lost (0.049), rest (0.040), unknown (0.036), 2009 (0.034), million (0.033)
Keywords: texas (
Representative documents:1. There's more union jobs growing in Texas, which is a right-to-work state, than in Illinois, and factory workers make more money in Texas than they do in Illinois.
2. This census is also the shortest and least intrusive count in modern history.
3. Says the Texas Forest Service planted 6,000 trees in the Dallas/Fort Worth area before the Super Bowl for unknown reasons at an unknown cost.
--------------------------------------------------------------------------------
53: 53_flag_american_pride_fellow (17 documents)
TOPIC 0.144), american (0.057), pride (0.054), fellow (0.036), an (0.035), soldier (0.031), christmas (0.030), rourke (0.030)
Keywords: flag (
Representative documents:1. McCain tells of a fellow Vietnam POW who was beaten for fashioning an American flag that became an inspirational symbol to fellow POWs.
2. The official pride flag was altered to include Ukrainian colors.
3. Not one American flag on the massive stage at the Democratic National Convention until people started complaining- then a small one. Pathetic.
--------------------------------------------------------------------------------
54: 54_businesses_small_minority_closing (16 documents)
TOPIC 0.233), small (0.090), minority (0.076), closing (0.057), business (0.050), opening (0.040), contracts (0.039), latino (0.037)
Keywords: businesses (
Representative documents:1. More businesses are closing than are opening for the first time in our nation's history.
2. Small businesses "create 70 percent of the jobs in America.
3. Small businesses (are) going out of business in record numbers.
--------------------------------------------------------------------------------
55: 55_amnesty_immigrants_illegal_path (16 documents)
TOPIC 0.146), immigrants (0.089), illegal (0.077), path (0.059), citizenship (0.051), dreamers (0.039), stage (0.035), hard (0.034)
Keywords: amnesty (
Representative documents:1. A majority of the men and women on this stage have previously and publicly embraced amnesty. I am the only candidate on this stage who has never supported amnesty.
2. Up until two weeks ago, John McCain was a leading proponent of amnesty. Now with me challenging him, suddenly he has changed.
3. The voting bill known as H.R. 1 would mean "that millions of illegal immigrants are quickly registered to vote.
--------------------------------------------------------------------------------
56: 56_sanctuary_cities_city_violate (15 documents)
TOPIC 0.304), cities (0.108), city (0.095), violate (0.063), ice (0.047), illegal (0.045), mayor (0.042), harboring (0.039)
Keywords: sanctuary (
Representative documents:1. Tom Leppert pledged to make Dallas a sanctuary city for illegal immigrants.
2. Under the mayor's COVID-19 order, Kansas City "churchgoers must register with the government.
3. Says Ohio congressional candidate Danny O'Connor "would make Columbus a sanctuary city for illegal immigrants.
--------------------------------------------------------------------------------
57: 57_hillary_fence_borders_wall (14 documents)
TOPIC 0.078), fence (0.067), borders (0.065), wall (0.062), open (0.053), clinton (0.052), solution (0.046), want (0.042)
Keywords: hillary (
Representative documents:1. Quotes Andy Borowitz as saying "stopping Hillary is a short-term solution. The long-term solution — and it will be more difficult — is fixing the educational system that has created so many people ign...
2. Says Barack Obama, Chuck Schumer and Hillary Clinton "all voted for" a border wall as senators.
3. Democrats and Joe Biden "want to abolish the suburbs altogether by ending single-family home zoning.
--------------------------------------------------------------------------------
58: 58_homeless_homelessness_california_highest (13 documents)
TOPIC 0.115), homelessness (0.107), california (0.058), highest (0.053), population (0.049), francisco (0.044), housing (0.043), people (0.042)
Keywords: homeless (
Representative documents:1. Nearly 60% of all occupants of HUD properties in U.S. are illegals.
2. Last year, Beaverton School District had the highest number of homeless students ever recorded in Oregon.
3. Homelessness has skyrocketed across California. We have the nation's highest homelessness rate and the nation's highest homeless population.
--------------------------------------------------------------------------------
59: 59_average_ceo_workers_worker (13 documents)
TOPIC 0.089), ceo (0.076), workers (0.071), worker (0.069), 000 (0.064), railroad (0.052), makes (0.050), ceos (0.047)
Keywords: average (
Representative documents:1. Say the ratio of CEO pay to average worker pay in the U.S. is 475 to 1.
2. Most of the people that work in finance make $70,000, $80,000 a year.
3. It is wrong for the average federal worker today to make more than $100,000 while the average private-sector worker makes less than $70,000.
--------------------------------------------------------------------------------
60: 60_carolina_north_hb2_candidates (12 documents)
TOPIC 0.180), north (0.153), hb2 (0.076), candidates (0.038), ranks (0.035), south (0.034), politically (0.030), orange (0.030)
Keywords: carolina (
Representative documents:1. Despite the passage of HB2, "towns, cities and counties in North Carolina are still allowed to set stricter non-discrimination policies for their own employees if they choose.
2. North Carolina ranks last in the country in K-12 funding.
3. Says North Carolina bill "would allow politically active 501(c)(4) organizations to hide major donors while using their money to support or oppose candidates and political issues.
--------------------------------------------------------------------------------
61: 61_dead_damar_hamlin_floyd (12 documents)
TOPIC 0.119), damar (0.096), hamlin (0.096), floyd (0.072), hospital (0.062), 101st (0.053), honored (0.053), tee (0.053)
Keywords: dead (
Representative documents:1. Premeditation, in murder cases like the Oscar Pistorius case, "can be formed in the twinkling of an eye.
2. Legendary actor Kirk Douglas dead, 4 days before his 101st birthday.
3. Henry Kissinger's "deathbed confession" shows "Adolf Hitler was buried in Spain" and lived to age 68.
--------------------------------------------------------------------------------
62: 62_stadium_game_rooting_stadiums (12 documents)
TOPIC 0.138), game (0.056), rooting (0.053), stadiums (0.053), bowl (0.046), virtually (0.044), super (0.041), owned (0.039)
Keywords: stadium (
Representative documents:1. Says proposal to use city-owned land for an Austin pro soccer stadium is a "$1 billion giveaway.
2. Thunderous crowd at Philadelphia Phillies game "literally registering on the Penn State University Brandywine seismograph station.
3. Protests of (Sisters of Perpetual Indulgence) at a Los Angeles Dodgers game led to a "virtually empty stadium for the game itself.
--------------------------------------------------------------------------------
63: 63_nfl_anthem_kneel_players (12 documents)
TOPIC 0.122), anthem (0.109), kneel (0.089), players (0.079), football (0.073), game (0.067), tebow (0.063), knee (0.059)
Keywords: nfl (
Representative documents:1. Derek Chauvin had his left hand in his pocket while kneeling on George Floyd.
2. A photo shows Joe Namath and Al Woodall kneeling on the sideline of a football field during the national anthem.
3. Tim Tebow used to "kneel in prayer and the NFL complained.
--------------------------------------------------------------------------------
64: 64_lowest_florida_per_state (12 documents)
TOPIC 0.096), florida (0.077), per (0.065), state (0.064), odd (0.057), burden (0.049), pick (0.046), tennessee (0.039)
Keywords: lowest (
Representative documents:1. Says Ohio is one of just 17 states with an estate tax, and it has "the lowest threshold in the nation.
2. The state constitution "clearly says there's not to be a state income tax in Tennessee.
3. We have the lowest per-capita spending of any state in the nation" except for South Dakota.
--------------------------------------------------------------------------------
65: 65_charlie_crist_minimum_wage (11 documents)
TOPIC 0.232), crist (0.227), minimum (0.070), wage (0.068), felons (0.066), raising (0.056), taxpayer (0.051), swindled (0.045)
Keywords: charlie (
Representative documents:1. Says Charlie Crist "is embroiled in a fraud case for steering taxpayer money to a de facto Ponzi scheme.
2. Two weeks after signing a taxpayer protection pledge, (Charlie Crist) breaks it.
3. Says Charlie Crist "voted against raising the minimum wage.
--------------------------------------------------------------------------------
66: 66_obama_hamas_reparations_elon (11 documents)
TOPIC 0.067), hamas (0.067), reparations (0.067), elon (0.055), musk (0.055), million (0.051), organization (0.049), fight (0.047)
Keywords: obama (
Representative documents:1. With 40 billion dollars, Elon Musk could have given each of the 330M people living in America a million dollars and still had $7B left over.
2. Says President Barack Obama's homeland security budget had "$16 million to fight climate change" but "didn't have a line item to fight violent extremism.
3. Obama used $20 million in federal money "to emmigrate (sic) Hamas Refugees to the USA.
--------------------------------------------------------------------------------
67: 67_formula_baby_shortage_obesity (11 documents)
TOPIC 0.195), baby (0.144), shortage (0.134), obesity (0.092), current (0.069), kids (0.063), kills (0.063), homemade (0.063)
Keywords: formula (
Representative documents:1. The United States can't import baby formula because of the U.S.-Mexico-Canada Agreement.
2. Babysitter on crystal meth eats 3-month-old toddler.
3. Donations of baby formula to Ukraine helped cause the current U.S. shortage.
--------------------------------------------------------------------------------
68: 68_fund_rainy_budget_budgets (11 documents)
TOPIC 0.125), rainy (0.102), budget (0.072), budgets (0.065), governor (0.051), day (0.050), state (0.042), continual (0.042)
Keywords: fund (
Representative documents:1. Says John Kitzhaber's proposed budget for upcoming biennium represents 8 percent increase over last
2. As state Commerce secretary, Mary Burke drafted, sponsored and promoted "budgets that raised taxes by billions, created high structural deficits and raided funds to finance huge spending hikes.
3. The rainy day fund … is $320 million. That's much more than what it has been in the past.
--------------------------------------------------------------------------------
69: 69_gender_old_consent_boy (11 documents)
TOPIC 0.087), old (0.058), consent (0.057), boy (0.057), school (0.055), trans (0.055), girl (0.050), age (0.044)
Keywords: gender (
Representative documents:1. U.N. calls for decriminalizing sex with minors.
2. Schools in Austin Independent School District are teaching children as young as 4 years old that their gender is fluid – that they're not a boy or a girl and they can choose whichever gender they want...
3. At a Wisconsin school, a 12-year-old girl "was transitioned into a boy by school officials without parental consent," and Protasiewicz supports it.
--------------------------------------------------------------------------------
70: 70_virginia_west_virginians_moving (11 documents)
TOPIC 0.174), west (0.131), virginians (0.059), moving (0.043), average (0.040), 2017 (0.040), pinnacle (0.035), fun (0.035)
Keywords: virginia (
Representative documents:1. Over the last few years, more Virginians are moving away from Virginia than are moving to Virginia from the other 49 states.
2. West Virginia is the state with the oldest population.
3. In 2017, West Virginia's $43,469 median household income was $16,867 below the national average, ranking 50th.
--------------------------------------------------------------------------------
71: 71_jefferson_thomas_written_government (10 documents)
TOPIC 0.067), thomas (0.062), written (0.061), government (0.058), restrain (0.054), constitution (0.054), independence (0.051), said (0.047)
Keywords: jefferson (
Representative documents:1. Quotes Thomas Jefferson as saying that "a government big enough to give you everything you want, is strong enough to take everything you have.
2. Thomas Jefferson said, "That government is best which governs the least, because its people discipline themselves.
3. Says George Washington said, "A free people ought not only be armed and disciplined. But they should have sufficient arms and ammunition to maintain a status of independence from any who might attempt...
--------------------------------------------------------------------------------
-1 (1894 documents)
OUTLIER TOPIC: 't fit well into any topic
Documents that don--------------------------------------------------------------------------------
These results allow us to confirm the conclusions from the topic frequency analysis.
A look at topic distinctiveness
While frequency tells you how prevalent a topic is in your dataset, distinctiveness (also called semantic purity or coherence) tells you how conceptually focused that topic is.
- A frequent topic appears often — but could be vague or messy (e.g., generic words like “said” or “people”).
- A distinctive topic has words that are strongly associated with it and not shared with other topics — making it easier to interpret and label.
🧰 How BERTopic Measures Distinctiveness: c-TF-IDF
To quantify this, BERTopic uses class-based TF-IDF (c-TF-IDF) — a clever twist on standard TF-IDF.
Instead of calculating term frequency per document, c-TF-IDF calculates term frequency per topic.
This helps emphasize words that are especially characteristic of each topic, even if they’re not globally rare.
🔍 Example
Let’s say: - Topic A includes 200 statements about healthcare. - Topic B includes 150 statements about education.
Suppose the word “medicare” appears 120 times in Topic A but barely shows up in Topic B or any other topic.
In c-TF-IDF: - “medicare” gets a high score for Topic A because: - It’s common within Topic A (high term frequency), - But uncommon in other topics (high inverse topic frequency).
Compare that to a word like “people”: - Even if “people” appears often, it shows up across all topics, so it gets a low c-TF-IDF score — it’s not distinctive.
🎯 Why this matters
By using c-TF-IDF, BERTopic highlights the most topic-specific keywords, which helps: - Identify clear, well-separated topics, - Surface coherent themes that are easier to label, - Filter out blurry, overlapping ones.
So when evaluating topics, don’t just consider how many documents belong to them — look at how sharply defined they are, too.
# Get the c-TF-IDF matrix (topics × vocabulary)
= topic_model.c_tf_idf_
ctfidf
# Get vocabulary terms
= topic_model.vectorizer_model
vectorizer = vectorizer.get_feature_names_out()
vocab
# Calculate average c-TF-IDF score per topic (excluding outliers)
= topic_model.get_topic_info().Topic.tolist()
topic_ids = [t for t in topic_ids if t != -1]
topic_ids
= {}
avg_scores for topic_id in topic_ids:
= ctfidf[topic_id].toarray().flatten()
scores = scores.mean()
avg_scores[topic_id]
# Sort topics by average c-TF-IDF score (descending)
= sorted(avg_scores.items(), key=lambda x: x[1], reverse=True)
distinctive_topics print("Most distinctive topics (by average c-TF-IDF):")
for topic_id, score in distinctive_topics[:10]:
print(f"Topic {topic_id}: avg c-TF-IDF score = {score:.4f}")
-TF-IDF):
Most distinctive topics (by average c62: avg c-TF-IDF score = 0.0004
Topic 17: avg c-TF-IDF score = 0.0004
Topic 14: avg c-TF-IDF score = 0.0004
Topic 68: avg c-TF-IDF score = 0.0004
Topic 30: avg c-TF-IDF score = 0.0004
Topic 38: avg c-TF-IDF score = 0.0003
Topic 64: avg c-TF-IDF score = 0.0003
Topic 34: avg c-TF-IDF score = 0.0003
Topic 18: avg c-TF-IDF score = 0.0003
Topic 57: avg c-TF-IDF score = 0.0003 Topic
for topic_id, _ in distinctive_topics[:10]:
print(f"\nTopic {topic_id} top words:")
print(topic_model.get_topic(topic_id))
62 top words:
Topic 'stadium', 0.13795182106578005), ('game', 0.05619589128928586), ('rooting', 0.05327750905772074), ('stadiums', 0.05327750905772074), ('bowl', 0.04581549316615795), ('virtually', 0.04433311469056421), ('super', 0.04104105166066736), ('owned', 0.0394148060187943), ('built', 0.03806296441850709), ('public', 0.03188214531714473)]
[(
17 top words:
Topic 'pelosi', 0.10193614680765169), ('nancy', 0.08614272950243061), ('kamala', 0.047539981628650554), ('harris', 0.04700065055706048), ('schiff', 0.039357629122400754), ('resign', 0.03700831367816971), ('adam', 0.03218324769956764), ('video', 0.03207188913138721), ('impeachment', 0.0288616172039607), ('house', 0.026639451129318155)]
[(
14 top words:
Topic 'taxes', 0.08554675464390696), ('tax', 0.06657037990590066), ('kaine', 0.051610650943962225), ('tim', 0.04894487845317714), ('increase', 0.04311120112099205), ('raise', 0.042212501062001036), ('raised', 0.02947163743405533), ('voted', 0.02650671151725443), ('your', 0.022968079278039084), ('senate', 0.021749891167399338)]
[(
68 top words:
Topic 'fund', 0.12498161978974934), ('rainy', 0.10176995471620823), ('budget', 0.07159927989262056), ('budgets', 0.06526044084714358), ('governor', 0.050669157668142714), ('day', 0.05046184333916669), ('state', 0.0423656523909458), ('continual', 0.041970405276855), ('320', 0.041970405276855), ('drafted', 0.041970405276855)]
[(
30 top words:
Topic 'oil', 0.12457196219360095), ('gallon', 0.047724472048081196), ('gas', 0.04685677581127021), ('pump', 0.03590058184018368), ('cents', 0.035620178408996735), ('prices', 0.03420087638388947), ('energy', 0.032642306204577504), ('russia', 0.03210570051464789), ('price', 0.03121993960515881), ('barrel', 0.029769442536033697)]
[(
38 top words:
Topic 'romney', 0.18653074546391543), ('mitt', 0.16723359308182045), ('massachusetts', 0.06105821994399273), ('bain', 0.04192987361682082), ('capital', 0.030529109971996364), ('bankruptcy', 0.02562738772992853), ('maxwell', 0.02479820342280418), ('plant', 0.023491247792721437), ('first', 0.020896770370246082), ('wants', 0.020590686651728807)]
[(
64 top words:
Topic 'lowest', 0.09579545718244353), ('florida', 0.07651895235795685), ('per', 0.06506454923414083), ('state', 0.06392010711616385), ('odd', 0.0572499549085157), ('burden', 0.049265060639073225), ('pick', 0.04629285553076316), ('tennessee', 0.0390986847009778), ('capita', 0.0390986847009778), ('tax', 0.03875309197073962)]
[(
34 top words:
Topic 'jersey', 0.11096255370054298), ('new', 0.07722761456544958), ('jobs', 0.03390893474854867), ('states', 0.02922026111759179), ('state', 0.028925068319436587), ('sector', 0.023057284766550175), ('growth', 0.021879733947925584), ('york', 0.020410105899409905), ('private', 0.0193397284259531), ('job', 0.018041011257105694)]
[(
18 top words:
Topic 'clinton', 0.1041847440016645), ('hillary', 0.09732329944960937), ('she', 0.04220664324576848), ('emails', 0.03760946363984845), ('fbi', 0.03275851517384305), ('her', 0.0324574756138043), ('email', 0.029237592966797787), ('laughing', 0.024244365943473428), ('department', 0.022509518642157698), ('trump', 0.021424032410235324)]
[(
57 top words:
Topic 'hillary', 0.07785247790730068), ('fence', 0.06704699443943352), ('borders', 0.06515290778403704), ('wall', 0.06165555222700382), ('open', 0.05293297966047305), ('clinton', 0.0515921708979436), ('solution', 0.04619257541443909), ('want', 0.04174767169650181), ('wants', 0.03711407717472106), ('border', 0.03711407717472106)] [(
BERTopic helps us some really distinctive topics (we look at the top 10):
Topic 10 – National Debt & Fiscal Policy
Top words: debt
, trillion
, national
, deficit
, spending
, budget
Interpretation: This cluster seems focused on concerns about the U.S. national debt, budget deficits, and government spending limits like the debt ceiling.
Topic 13 – Obama, Foreign Policy & Military
Top words: obama
, syria
, troops
, iraq
, isis
Interpretation: Highly centered on Barack Obama, especially in the context of foreign policy, military intervention, and possibly conspiracy-related claims (e.g., “kenya”).
Topic 64 – Stadiums & Sports Economics
Top words: stadium
, game
, bowl
, owned
, super
Interpretation: Talks about sports infrastructure, maybe public funding of stadiums or major events like the Super Bowl.
Topic 24 – Texas Cities & Local Governance
Top words: austin
, texas
, council
, residents
Interpretation: Likely about local issues in Texas cities, especially Austin — maybe urban policy, city council decisions, or local disputes.
Topic 21 – Prominent Democratic Figures (Pelosi, Harris)
Top words: pelosi
, nancy
, kamala
, resign
, video
Interpretation: Focuses on U.S. political figures — possibly criticism, controversy, or viral content about Pelosi and Harris.
Topic 58 – U.S. Constitution & Founding Values
Top words: amendment
, speech
, jefferson
, written
Interpretation: Philosophical or historical references to the Constitution, possibly used in political argumentation.
Topic 22 – Climate Change & Global Warming
Top words: climate
, warming
, ice
, hoax
, change
Interpretation: Clear climate-related content — including denialist terms (hoax
) that suggest politicized debate.
Topic 73 – Homelessness in California
Top words: homeless
, francisco
, california
, housing
Interpretation: Focused on homelessness issues, probably with a strong emphasis on California or San Francisco.
Topic 71 – Amanda Fritz & Odd Claims
Top words: amanda
, fritz
, endorsed
, invention
, mercury
Interpretation: Very specific — may relate to Portland City Commissioner Amanda Fritz, and seems to include fringe or quirky claims (e.g., “invention”, “mercury”).
Topic 28 – Income & Taxation
Top words: income
, tax
, earners
, rate
, americans
Interpretation: Economic fairness, taxation, and income brackets — probably debates on who pays what and whether the tax system is fair.
✨ Summary
These are some crystal-clear, high-purity topics, great for: - Labeling (we can assign them meaningful names) - Segmenting by theme - Spotting highly specific political narratives
Footnotes
in reference to the North Carolina bill passed in 2016 that required people to use bathrooms corresponding to their birth gender.↩︎