✅ Reading Week Homework: Solutions

2025/26 Winter Term

Author

The DS202 Team

Materials to download

This solution file follows the format of the Jupyter Notebook file .ipynb you had to fill in for your homework. If you want to render the document yourselves and play with the code, you can download the .ipynb version of this solution file by clicking on the button below:

⚙️ Setup

# Prefer conda-forge for dependency resolution; use pip only if unavailable.
#
# conda install -c conda-forge pandas scikit-learn matplotlib seaborn xgboost plotly shap
#
# pip fallback (only if needed):
# !pip install shap

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
import shap

from joblib import Parallel, delayed
from scipy.stats import loguniform
import plotly.graph_objects as go

from sklearn.model_selection import (
    train_test_split, StratifiedKFold,
    cross_validate, GridSearchCV, RandomizedSearchCV
)
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score, precision_score, recall_score, make_scorer
from xgboost import XGBClassifier

sns.set_theme(style='whitegrid', palette='muted')
RANDOM_STATE = 123

🥅 Homework objectives

We have introduced a range of supervised learning algorithms in previous labs and lectures. Selecting the best model is rarely straightforward in practice. In this assignment, we guide you through that process using a real labour economics dataset.

Compare the decision boundaries produced by four models trained to predict whether a US worker earns above the median wage.
Apply 5-fold cross-validation to determine which model performs best overall.
Fine-tune a Support Vector Machine and an XGBoost model.
Use SHAP values to understand what drives predictions in the best model.
Use multinomial logistic regression to predict wage tercile membership and interpret its coefficients in the context of the gender wage gap.

Loading the CPS-MORG 2019 dataset

We use the 2019 Current Population Survey Merged Outgoing Rotation Group (CPS-MORG), the benchmark dataset in the US labour economics literature on wage inequality and the gender pay gap (Blau & Kahn 2017 (Blau and Kahn 2017), among many others). Freely available from the NBER — US Government public domain data.

Each row is a full-time wage/salary worker aged 18–64 surveyed in 2019.

Variable	Description
`hourly_wage`	Computed hourly wage (weekly earnings ÷ hours, or direct rate if paid hourly)
`above_median_wage`	Binary outcome: 1 if wage > sample median
`wage_tercile`	Three-class outcome: `bottom`, `middle`, `top`
`female`	1 = female, 0 = male
`educ_years`	Years of schooling (recoded from CPS categories)
`potential_exp`	age − educ_years − 6 (Mincer convention)
`age`	Age in years
`married`	1 = married with spouse present
`union_mem`	1 = union member or covered by contract
`public_sector`	1 = government industry
`occ_group`	`management`, `professional`, `service`, `sales_office`, `blue_collar`
`white`, `black`, `hispanic`	Race/ethnicity indicators

Note

📋 Data preparation note

The CSV was constructed from the NBER CPS MORG file (morg19.dta) using the following steps:

Sample restrictions: full-time workers (≥ 35 hours/week), age 18–64, with valid earnings.

Hourly wage:

If paid hourly, use earnhre.

Otherwise compute earnwke ÷ uhourse.

Education: educ_years recoded from grade92 to represent completed years of schooling.

Potential experience:
\[ \text{potential\_exp} = \text{age} - \text{educ\_years} - 6 \] This follows the Mincer (1974) earnings framework (Mincer 1974), which models wages as a function of schooling and labour market experience.

Union membership: union_mem = unionmme OR unioncov.

Public sector indicator: ind02 ∈ [8770, 9590].

Occupation group: derived from occ2012.

You are not required to reproduce these transformations, but you should understand what each constructed variable represents and why it may matter economically.

df = pd.read_csv('data/cps_morg_2019_wages.csv')
print(df.shape)
df.head()

(113654, 14)

	hourly_wage	above_median_wage	wage_tercile	female	educ_years	potential_exp	age	married	public_sector	occ_group	white	black
0	20.91325	0	middle	1	14.0	34.0	54	0	0	sales_office	0	1
1	28.84600	1	middle	1	16.0	35.0	57	0	1	professional	0	1
2	12.25000	0	bottom	1	14.0	3.0	23	0	0	service	0	1
3	200.00000	1	top	1	12.0	6.0	24	0	0	service	0	1
4	56.77400	1	top	0	16.0	15.0	37	1	0	professional	1	0

Before fitting any models, explore the data.

❓ Question 1: Compute the mean hourly wage for men and women separately. What is the raw (unadjusted) hourly wage gap? Does the size surprise you?

gap = df.groupby('female')['hourly_wage'].mean().rename({0: 'Male', 1: 'Female'})
print(gap)
print(f"\nRaw gap: ${gap['Male'] - gap['Female']:.2f}/hr  "
      f"({(gap['Male'] - gap['Female']) / gap['Male']:.1%} of male wage)")

female
Male      29.629973
Female    26.657336
Name: hourly_wage, dtype: float64

Raw gap: $2.97/hr  (10.0% of male wage)

Answer Q1

The raw gap among full-time workers is around $3/hr (~10%), meaning women earn roughly 90 cents for every dollar earned by men. This is narrower than the BLS headline (~82 cents) because we restrict to full-time workers — part-time employment, which is disproportionately female and lower-paid, is excluded. The gap is sensitive to sample composition before a single model has been fit.

Feature setup

occ_group is categorical and must be one-hot encoded before passing to scikit-learn.

NUMERIC_FEATURES = [
    'female', 'educ_years', 'potential_exp', 'age',
    'married', 'union_mem', 'public_sector',
    'white', 'black', 'hispanic'
]
TARGET_BINARY = 'above_median_wage'

df_enc = pd.get_dummies(df, columns=['occ_group'], drop_first=True)
OCC_DUMMIES = [c for c in df_enc.columns if c.startswith('occ_group_')]
ALL_FEATURES = NUMERIC_FEATURES + OCC_DUMMIES

X = df_enc[ALL_FEATURES]
y = df_enc[TARGET_BINARY]
print(f'Features ({len(ALL_FEATURES)}): {ALL_FEATURES}')

Features (14): ['female', 'educ_years', 'potential_exp', 'age', 'married', 'union_mem', 'public_sector', 'white', 'black', 'hispanic', 'occ_group_management', 'occ_group_professional', 'occ_group_sales_office', 'occ_group_service']

Inspecting class balance (binary outcome)

Before splitting the data, inspect the distribution of the binary outcome.

❓ Are the two classes balanced? What might this imply for how you evaluate model performance?

### Answer

Because above_median_wage is defined using the sample median, the split is approximately 50/50 (up to ties at the median). In this setting, metrics like accuracy and ROC AUC are not mechanically distorted by severe imbalance. That said, it is still good practice to check class balance: if one class were rare, you would place more emphasis on class-sensitive metrics (e.g. precision/recall or PR AUC) and on how you choose a classification threshold.

counts = y.value_counts()
props  = y.value_counts(normalize=True)
display(pd.DataFrame({'count': counts, 'share': props.round(3)}))

	count	share
above_median_wage
0	56827	0.5
1	56827	0.5

Tip

Do not assume balance: verify it.

If one class were much rarer than the other, would all metrics still be informative?

Train / test split

Perform a stratified 75/25 split. Stratifying on y preserves class balance in both sets.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=RANDOM_STATE
)
print(f'Train: {X_train.shape[0]:,}  |  Test: {X_test.shape[0]:,}')

Train: 85,240  |  Test: 28,414

Stratified subsample for SVM steps

SVMs scale as O(n²)–O(n³) in training samples. We draw a 5,000-row stratified subsample — preserving class balance — which keeps wall time under a minute while producing stable cross-validated estimates at this feature dimensionality.

X_train_sub, _, y_train_sub, _ = train_test_split(
    X_train, y_train,
    train_size=5_000,
    stratify=y_train,
    random_state=RANDOM_STATE
)
print(f'Full training set : {len(X_train):,} rows  (balance: {y_train.mean():.3f})')
print(f'SVM subsample     : {len(X_train_sub):,} rows  (balance: {y_train_sub.mean():.3f})')

Full training set : 85,240 rows  (balance: 0.500)
SVM subsample     : 5,000 rows  (balance: 0.500)

The two balance figures should be nearly identical — confirmation that stratification worked.

5-fold cross-validation setup

We will evaluate models using 5-fold cross-validation.

This means:

Split the training data into 5 equally sized folds.
Train on 4 folds.
Evaluate on the remaining fold.
Repeat so each fold serves as validation once.

For classification problems, it is important that each fold preserves the class proportions of the outcome.

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

Evaluation metrics

We will evaluate models using multiple metrics.

Think about what each metric measures:

Does it depend on a probability threshold?
Does it treat false positives and false negatives symmetrically?
Does it assume balanced classes?

You should construct a dictionary mapping metric names to scorer objects.

scoring = {
    'precision': make_scorer(precision_score, zero_division=0),
    'recall':    make_scorer(recall_score),
    'f1':        make_scorer(f1_score),
}

🗺️ Decision boundaries

Each model is trained on all features in ALL_FEATURES.

To visualise its behaviour, we take a 2D slice of the high-dimensional feature space:

Vary potential_exp (x-axis)
Vary educ_years (y-axis)
Fix all other features at their training-set median

This produces a cross-section of a 14-dimensional decision surface.

We then:

Generate a dense grid of points in this 2D plane.
Ask the model to predict at each grid point.
Plot the predicted class regions.

❓ Which models produce linear boundaries?
❓ Which allow nonlinear curvature?
❓ How does flexibility differ across models?

Initialise the five models

We compare five classifiers spanning a range of boundary shapes:

Logistic regression — linear boundary, interpretable coefficients.
Decision tree — axis-aligned rectangular regions, fully interpretable but prone to overfitting.
Linear SVM — margin-maximising linear boundary.
RBF SVM — radial basis function kernel; the most widely used SVM kernel in practice. Boundary shape is controlled by C (margin width) and gamma (kernel bandwidth).
Polynomial SVM (degree 2) — quadratic boundary. Less common than RBF but useful to contrast: polynomial kernels expand the feature space explicitly, RBF kernels do so implicitly via an infinite-dimensional mapping.

models = {
    'Logistic\nRegression': LogisticRegression(max_iter=1000, random_state=RANDOM_STATE),
    'Decision\nTree':       DecisionTreeClassifier(random_state=RANDOM_STATE),
    'SVM\nLinear':          Pipeline([
        ('scaler', StandardScaler()),
        ('svm',    SVC(kernel='linear', random_state=RANDOM_STATE))
    ]),
    'SVM\nRBF':             Pipeline([
        ('scaler', StandardScaler()),
        ('svm',    SVC(kernel='rbf', random_state=RANDOM_STATE))
    ]),
    'SVM\nPolynomial':      Pipeline([
        ('scaler', StandardScaler()),
        ('svm',    SVC(kernel='poly', degree=2, random_state=RANDOM_STATE))
    ]),
}

Why are only the SVMs wrapped in a Pipeline?

Support Vector Machines are not scale-invariant.

The SVM margin depends on distances in feature space.
The RBF and polynomial kernels compute functions of inner products and Euclidean distances.
If one feature has a much larger numerical range than others, it can dominate the geometry of the margin.

For this reason, each SVM is wrapped in a Pipeline with a StandardScaler, which: 1. Fits the scaler only on the training data (avoiding leakage), 2. Standardises all features to zero mean and unit variance, 3. Then fits the SVM on the scaled features.

Tree-based models (e.g. decision trees) do not require scaling: - They split on feature thresholds. - They are invariant to monotonic transformations of individual features.

Logistic regression is technically scale-sensitive, but: - Scaling is not strictly required for correctness, - It mainly affects optimisation speed and coefficient comparability, - Here we prioritise interpretability of raw-scale coefficients.

Using a Pipeline ensures preprocessing is consistently applied inside cross-validation and hyperparameter tuning.

Build the 2D visualisation grid

We want to visualise how predicted probabilities vary across two key features: - potential_exp (x-axis) - educ_years (y-axis)

We therefore construct two objects:

1️⃣ `obs_grid`

obs_grid summarises the observed data: - For each integer combination of (potential_exp, educ_years) - Compute the observed proportion of above-median earners

This will later be used to colour observed data points.

2️⃣ `vis_grid`

vis_grid is a dense prediction grid used for model visualisation.

Steps:

Create evenly spaced sequences for potential_exp and educ_years using np.linspace.
Use np.meshgrid to create all coordinate pairs.
Construct a DataFrame where:
- Every row corresponds to one grid point.
- All other features are fixed at their training-set median.
- Only potential_exp and educ_years vary across rows.

❓ Why do we use continuous sequences rather than only observed values?
❓ What does fixing other features at their median imply about this visualisation?

obs_grid = (
    df_enc
    .assign(potential_exp=df_enc['potential_exp'].astype(int))
    .groupby(['potential_exp', 'educ_years'])[TARGET_BINARY]
    .mean().reset_index()
    .rename(columns={TARGET_BINARY: 'prop_above_median'})
)

medians  = X_train.median()
N_GRID   = 80
exp_lin  = np.linspace(df['potential_exp'].min(), df['potential_exp'].max(), N_GRID)
educ_lin = np.linspace(df['educ_years'].min(),   df['educ_years'].max(),   N_GRID)
exp_mesh, educ_mesh = np.meshgrid(exp_lin, educ_lin)

vis_grid = pd.DataFrame(
    np.tile(medians.values, (exp_mesh.size, 1)),
    columns=ALL_FEATURES
)
vis_grid['potential_exp'] = exp_mesh.ravel()
vis_grid['educ_years']    = educ_mesh.ravel()
print(f'Grid: {N_GRID}×{N_GRID} = {len(vis_grid):,} points')

Grid: 80×80 = 6,400 points

What did we just build?

We constructed two different objects, serving different purposes.

1️⃣ `obs_grid` — a summary of the observed data

We convert potential_exp to integers so we can group by discrete experience levels.
For each (potential_exp, educ_years) combination, we compute the observed proportion of above-median earners.
This is purely descriptive — it summarises the empirical data.

It tells us: in the actual sample, what fraction of workers in each cell earn above the median.

2️⃣ `vis_grid` — a synthetic prediction grid

This is not real data.

We:

Take the median feature vector from the training data.
Replicate it $N \times N$ times.
Overwrite only two columns (potential_exp, educ_years) using a dense meshgrid.

This produces a smooth grid of points covering the 2D plane.

When we pass vis_grid to a model’s predict or predict_proba method, we obtain:

A 2D cross-section of the model’s high-dimensional decision surface.

Why fix all other features at their median?

Because the models were trained in 14 dimensions.

To visualise them in 2D, we must hold the remaining 12 features constant.
Using the median gives a “typical worker” profile.

This means:

The boundary you see is conditional on all other features being fixed.
Changing those fixed values would shift the boundary.

This is a slice of the model — not the full decision surface.

Fit models and generate predictions

joblib.Parallel runs independent tasks concurrently across CPU cores. Since each model fit is independent of the others, we can parallelise across them. The pattern is always:

def my_task(arg):               # define the work as a plain function
    return do_something(arg)

results = Parallel(n_jobs=-1)(  # n_jobs=-1 uses all available cores
    delayed(my_task)(x)         # delayed() wraps each call lazily
    for x in my_list
)

Write a fit_one helper that fits one model and returns (name, {y_pred, grid_pred, f1}), then dispatch it with Parallel.

def fit_one(name, model, X_tr, y_tr, X_te, y_te, X_grid):
    model.fit(X_tr, y_tr)
    y_pred    = model.predict(X_te)
    grid_pred = model.predict(X_grid)
    f1        = f1_score(y_te, y_pred)
    return name, {'y_pred': y_pred, 'grid_pred': grid_pred, 'f1': f1}

fitted = Parallel(n_jobs=-1)(
    delayed(fit_one)(name, model,
                     X_train_sub, y_train_sub,
                     X_test, y_test,
                     vis_grid[ALL_FEATURES])
    for name, model in models.items()
)
results = dict(fitted)

for name, res in results.items():
    print(f"{name.replace(chr(10), ' '):<25}  F1 = {res['f1']:.4f}")

Logistic Regression        F1 = 0.7154
Decision Tree              F1 = 0.6225
SVM Linear                 F1 = 0.7093
SVM RBF                    F1 = 0.7176
SVM Polynomial             F1 = 0.6713

Visualising decision boundaries

We now compare how different classifiers partition the feature space.

Using two selected features:

Fit each model on the training data.
Generate predictions over a fine grid spanning the feature space.
Plot the predicted class regions.
Overlay the training observations.

❓ How do the boundaries differ across models?
❓ Which models produce linear separations? Which allow nonlinear structure?

fig, axes = plt.subplots(2, 3, figsize=(18, 11))
cmap_bg  = plt.cm.colors.ListedColormap(['#d6eaf8', '#1a5276'])
cmap_pts = plt.cm.RdYlGn

for ax, (name, res) in zip(axes.flatten(), results.items()):
    Z = res['grid_pred'].reshape(N_GRID, N_GRID).astype(float)
    ax.pcolormesh(exp_lin, educ_lin, Z, cmap=cmap_bg, alpha=0.4, shading='auto')
    sc = ax.scatter(
        obs_grid['potential_exp'], obs_grid['educ_years'],
        c=obs_grid['prop_above_median'], cmap=cmap_pts,
        s=20, alpha=0.85, zorder=2, vmin=0, vmax=1
    )
    ax.set_xlabel('Potential experience (years)', fontsize=10)
    ax.set_ylabel('Education (years)', fontsize=10)
    ax.set_title(f"{name}\nF1 = {res['f1']:.3f}\n(other features at median)", fontsize=10)
    ax.set_xlim(exp_lin[0], exp_lin[-1])
    ax.set_ylim(educ_lin[0], educ_lin[-1])

# Hide the unused sixth panel
axes.flatten()[-1].set_visible(False)

cbar_ax = fig.add_axes([0.92, 0.15, 0.012, 0.7])
fig.colorbar(sc, cax=cbar_ax, label='Observed prop. above median wage')
p0 = mpatches.Patch(color='#d6eaf8', alpha=0.7, label='Predicted: below median')
p1 = mpatches.Patch(color='#1a5276', alpha=0.7, label='Predicted: above median')
fig.legend(handles=[p0, p1], loc='lower center', ncol=2,
           bbox_to_anchor=(0.45, 0.01), fontsize=10)
fig.suptitle('Decision boundaries — above median wage (CPS-MORG 2019)', fontsize=13)
fig.subplots_adjust(right=0.90, hspace=0.45, wspace=0.3, bottom=0.08)
plt.show()

❓ Question 2: Which model achieves the highest F1? Which produces the most interpretable boundary shape, and why is that shape more or less useful for understanding this dataset?

Answer Q2

The best-performing model by F1 is the RBF SVM (0.718), followed extremely closely by logistic regression (0.715) and the linear SVM (0.709). The differences among these three are small.

The decision tree performs substantially worse (0.623), reflecting its high variance and axis-aligned boundary structure.

The key insight is structural:

The wage–education–experience relationship is approximately log-linear (the Mincer earnings framework).
Linear models already capture most of the signal.
The RBF SVM extracts a small additional gain by allowing mild nonlinearity.
The polynomial SVM (degree 2) underperforms relative to linear models, suggesting no strong quadratic structure in this 2D slice.

Importantly, the performance gap between RBF and logistic regression is only ~0.002 — practically negligible. Model choice here is unlikely to materially change decisions.

📊 5-fold cross-validation

Apply the same Parallel pattern to run cross_validate for all four models. Set n_jobs=1 inside cross_validate — the outer Parallel already uses all cores, so allowing each worker to also spawn its own thread pool would over-subscribe the CPU.

def cv_one(name, model, X, y, cv, scoring):
    return name, cross_validate(model, X, y, cv=cv, scoring=scoring, n_jobs=1)

cv_results = dict(
    Parallel(n_jobs=-1)(
        delayed(cv_one)(name, model, X_train_sub, y_train_sub, cv, scoring)
        for name, model in models.items()
    )
)

Build a tidy long-format DataFrame and plot mean ± 95% CI with Plotly.

rows = [
    {'model': mname.replace('\n', ' '), 'metric': metric,
     'fold': i + 1, 'score': score}
    for mname, res in cv_results.items()
    for metric in ['precision', 'recall', 'f1']
    for i, score in enumerate(res[f'test_{metric}'])
]
long_df  = pd.DataFrame(rows)
summary  = (long_df.groupby(['model', 'metric'])['score']
            .agg(['mean', 'std']).reset_index())
summary['ci95'] = 1.96 * summary['std'] / np.sqrt(5)
rank_order = (summary[summary['metric'] == 'f1']
              .sort_values('mean', ascending=False)['model'].tolist())
summary['rank'] = summary['model'].apply(lambda m: rank_order.index(m))

palette    = {'f1': '#4c72b0', 'precision': '#dd8452', 'recall': '#55a868'}
metric_lab = {'f1': 'F1', 'precision': 'Precision', 'recall': 'Recall'}
offsets    = {'f1': -0.15, 'precision': 0.0, 'recall': 0.15}

fig   = go.Figure()
shown = set()
for metric in ['f1', 'precision', 'recall']:
    sub   = summary[summary['metric'] == metric].set_index('model').loc[rank_order].reset_index()
    x_pos = [sub.iloc[j]['rank'] + offsets[metric] for j in range(len(sub))]
    fig.add_trace(go.Scatter(
        x=x_pos, y=sub['mean'],
        error_y=dict(type='data', array=sub['ci95'].tolist(), visible=True,
                     color=palette[metric], thickness=1.5, width=6),
        mode='markers',
        marker=dict(size=13, color=palette[metric], line=dict(color='white', width=1)),
        name=metric_lab[metric],
        customdata=sub[['model', 'ci95']].values,
        hovertemplate=('<b>%{customdata[0]}</b><br>'
                       f'{metric_lab[metric]}: %{{y:.4f}}<br>'
                       '95% CI: ±%{customdata[1]:.4f}<extra></extra>'),
        showlegend=metric not in shown,
    ))
    shown.add(metric)

fig.update_layout(
    title='5-fold CV metrics — CPS-MORG 2019',
    xaxis=dict(tickmode='array', tickvals=list(range(len(rank_order))),
               ticktext=rank_order, title='Model (ranked by F1)'),
    yaxis_title='Cross-validated score',
    template='plotly_white', height=480, width=800,
    legend=dict(title='Metric', orientation='h',
                yanchor='bottom', y=-0.22, xanchor='center', x=0.5)
)
fig.show()

❓ Question 3: How does the polynomial SVM compare on precision vs. recall? What does the overlap of confidence intervals across models tell you about model selection here?

Answer Q3

The RBF SVM, logistic regression, and linear SVM are statistically indistinguishable:

Their confidence intervals overlap substantially.
The F1 differences are well below one standard deviation across folds.

The polynomial SVM shows lower recall and greater variability. The decision tree has the widest confidence intervals, reflecting sensitivity to fold composition.

Conclusion:

Simpler, well-regularised linear models generalise extremely well on this dataset.
Additional flexibility provides at most marginal gains.
The performance ranking is stable across folds.

🔧 SVM hyperparameter tuning

We tune two SVM kernels separately, each with its own parameter space.

RBF SVM

Two continuous parameters:

C: controls the penalty for misclassification (margin width).
gamma: controls the kernel bandwidth, determining how far the influence of a single training point extends.

We search a 5 × 5 log-spaced grid (25 combinations).

Polynomial SVM

C
degree (integer controlling polynomial complexity)

We search a 5 × 3 grid (15 combinations).

Both use GridSearchCV with 5-fold cross-validation, optimising ROC-AUC.

If you use a Pipeline, remember that hyperparameters are referenced using:

stepname__parameter (e.g. svm__C).

C_vals   = [0.000977, 0.0131, 0.177, 2.38, 32]
gamma_vals = [0.001, 0.01, 0.1, 1.0, 10.0]

# ── RBF SVM ──────────────────────────────────────────────────────────────────
rbf_pipe = Pipeline([('scaler', StandardScaler()),
                     ('svm', SVC(kernel='rbf', random_state=RANDOM_STATE))])
rbf_gs = GridSearchCV(rbf_pipe,
                      {'svm__C': C_vals, 'svm__gamma': gamma_vals},
                      cv=cv, scoring='roc_auc', n_jobs=-1, refit=True)
rbf_gs.fit(X_train_sub, y_train_sub)
print('RBF  — best params:', rbf_gs.best_params_)
print('RBF  — best AUC:   ', round(rbf_gs.best_score_, 4))

# ── Polynomial SVM ───────────────────────────────────────────────────────────
poly_pipe = Pipeline([('scaler', StandardScaler()),
                      ('svm', SVC(kernel='poly', random_state=RANDOM_STATE))])
poly_gs = GridSearchCV(poly_pipe,
                       {'svm__C': C_vals, 'svm__degree': [1, 2, 3]},
                       cv=cv, scoring='roc_auc', n_jobs=-1, refit=True)
poly_gs.fit(X_train_sub, y_train_sub)
print('Poly — best params:', poly_gs.best_params_)
print('Poly — best AUC:   ', round(poly_gs.best_score_, 4))

RBF  — best params: {'svm__C': 2.38, 'svm__gamma': 0.01}
RBF  — best AUC:    0.8015
Poly — best params: {'svm__C': 32, 'svm__degree': 1}
Poly — best AUC:    0.7931


fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# ── Left: RBF — mean CV AUC vs C (one line per gamma) ────────────────────────
rbf_df = pd.DataFrame(rbf_gs.cv_results_)
n_folds = cv.get_n_splits()

for gamma in gamma_vals:
    sub = rbf_df[rbf_df['param_svm__gamma'] == gamma].copy()
    sub['C'] = sub['param_svm__C'].astype(float)
    sub = sub.sort_values('C')
    axes[0].errorbar(
        sub['C'], sub['mean_test_score'],
        yerr=1.96 * sub['std_test_score'] / np.sqrt(n_folds),
        fmt='o-', capsize=3, label=f'gamma={gamma}'
    )

axes[0].set_xscale('log')
axes[0].set_xlabel('Cost (C)')
axes[0].set_ylabel('Mean CV AUC')
axes[0].set_title('RBF SVM — mean CV AUC vs C')
axes[0].legend(title='Gamma', bbox_to_anchor=(1.02, 1), loc='upper left')

# Star the best combination
best_c = float(rbf_gs.best_params_['svm__C'])
best_g = float(rbf_gs.best_params_['svm__gamma'])
best_row = rbf_df[
    (rbf_df['param_svm__C'].astype(float) == best_c) &
    (rbf_df['param_svm__gamma'].astype(float) == best_g)
].iloc[0]
axes[0].scatter([best_c], [best_row['mean_test_score']], s=250,
                color='black', marker='*', zorder=5)

# ── Right: Polynomial — mean CV AUC vs C (one line per degree) ───────────────
poly_df = pd.DataFrame(poly_gs.cv_results_)
for degree in [1, 2, 3]:
    sub = poly_df[poly_df['param_svm__degree'] == degree].copy()
    sub['C'] = sub['param_svm__C'].astype(float)
    sub = sub.sort_values('C')
    axes[1].errorbar(
        sub['C'], sub['mean_test_score'],
        yerr=1.96 * sub['std_test_score'] / np.sqrt(n_folds),
        fmt='o-', capsize=3, label=f'degree={degree}'
    )

axes[1].set_xscale('log')
axes[1].set_xlabel('Cost (C)')
axes[1].set_ylabel('Mean CV AUC')
axes[1].set_title('Polynomial SVM — mean CV AUC vs C')
axes[1].legend(title='Degree', bbox_to_anchor=(1.02, 1), loc='upper left')

# Star the best combination
best_c_p = float(poly_gs.best_params_['svm__C'])
best_d_p = int(poly_gs.best_params_['svm__degree'])
best_row_p = poly_df[
    (poly_df['param_svm__C'].astype(float) == best_c_p) &
    (poly_df['param_svm__degree'].astype(int) == best_d_p)
].iloc[0]
axes[1].scatter([best_c_p], [best_row_p['mean_test_score']], s=250,
                color='black', marker='*', zorder=5)

fig.subplots_adjust(wspace=0.7, right=0.82)
plt.show()

❓ Question 4: Compare the two tuning plots. For the RBF kernel, which region of the C × gamma space performs best, and what does that tell you about the decision boundary? For the polynomial kernel, which degree wins and why?

Answer Q4

RBF kernel

The best AUC occurs at moderate-to-large C and moderate gamma.

Interpretation:

Very small gamma → kernel becomes very wide → RBF behaves almost linearly.
Very large gamma → kernel becomes extremely local → high-variance boundary (overfitting).
Very small C → strong regularisation → underfitting.
Very large C → weak regularisation → risk of overfitting.

The optimal region balances margin flexibility (C) and locality (gamma), matching the scale of real structure in the data.

The smooth, monotonic behaviour across C for moderate gamma values reinforces that the underlying signal is mostly linear with mild curvature.

Polynomial kernel

The best-performing configuration is degree = 1 at large C, which is mathematically equivalent to a linear SVM.

This aligns with earlier results showing that linear decision boundaries capture most of the predictive structure.

Higher-degree kernels (degree 2 or 3) introduce curvature that does not correspond to genuine nonlinear structure in the wage process. On 5k samples, this additional flexibility increases variance without improving bias.

Conclusion: The wage–education–experience relationship in this cross-section is close to linear; additional polynomial complexity is unnecessary.

🌲 XGBoost tuning

Gradient boosting models contain several continuous hyperparameters that strongly influence model complexity.

Here we tune the learning rate, which controls how much each tree contributes to the ensemble.

Because learning rate spans several orders of magnitude, we use RandomizedSearchCV with a log-uniform prior rather than a fixed grid.

This allows us to: - Explore small and large values efficiently - Avoid over-concentrating on one region of the parameter space

Note

📝 The boosting family

The same RandomizedSearchCV pattern applies to:

lightgbm.LGBMClassifier

catboost.CatBoostClassifier

sklearn.ensemble.RandomForestClassifier

imblearn.ensemble.BalancedRandomForestClassifier

All follow the scikit-learn fit / predict API.

Some variants (e.g. BalancedRandomForest) explicitly address class imbalance during training.

xgb_param_dist = {'learning_rate': loguniform(1e-4, 1.0)}
xgb_model = XGBClassifier(n_estimators=100, eval_metric='logloss',
                          random_state=RANDOM_STATE, n_jobs=-1)
xgb_rs = RandomizedSearchCV(xgb_model, xgb_param_dist,
                            n_iter=20, cv=cv, scoring='roc_auc',
                            n_jobs=-1, random_state=RANDOM_STATE, refit=True)
xgb_rs.fit(X_train_sub, y_train_sub)
print('Best learning rate:', round(xgb_rs.best_params_['learning_rate'], 5))
print('Best AUC:          ', round(xgb_rs.best_score_, 4))

Best learning rate: 0.01604
Best AUC:           0.7926

xgb_cv_df = (pd.DataFrame(xgb_rs.cv_results_)
             .sort_values('param_learning_rate').reset_index(drop=True))
fig, ax = plt.subplots(figsize=(9, 4))
ax.scatter(xgb_cv_df['param_learning_rate'], xgb_cv_df['mean_test_score'],
           color='steelblue', zorder=2)
ax.errorbar(xgb_cv_df['param_learning_rate'], xgb_cv_df['mean_test_score'],
            yerr=1.96 * xgb_cv_df['std_test_score'] / np.sqrt(5),
            fmt='none', color='steelblue', alpha=0.4)
ax.axvline(xgb_rs.best_params_['learning_rate'], color='black', linestyle=':')
ax.set_xscale('log')
ax.set_xlabel('Learning rate (log scale)'); ax.set_ylabel('Mean CV AUC')
ax.set_title('XGBoost — randomised learning rate search (CPS-MORG 2019)')
plt.tight_layout(); plt.show()

print(f'Best RBF SVM AUC : {rbf_gs.best_score_:.4f}')
print(f'Best Poly SVM AUC: {poly_gs.best_score_:.4f}')
print(f'Best XGBoost AUC : {xgb_rs.best_score_:.4f}')

Best RBF SVM AUC : 0.8015
Best Poly SVM AUC: 0.7931
Best XGBoost AUC : 0.7926

❓ Question 5: How do all three tuned models compare on AUC? What are the practical trade-offs between them for a large-scale deployment?

Metric reflection

You have evaluated models using ROC AUC.

Is ROC AUC always the most appropriate metric?
Under what circumstances might it be insufficient?
Can you think of alternative evaluation metrics that may be more appropriate in some contexts?

For example, you may wish to explore:

Precision–Recall AUC
Class-specific recall or precision
The $F_\beta$ score (see (Van Rijsbergen 1979) or check this page)

Now consider the application more concretely:

If the goal were to identify high-wage workers for targeted recruitment, would another metric be preferable?
What if the goal were to identify low-wage workers for policy intervention?

Briefly justify your reasoning.

Answer Q5

All tuned models (RBF SVM, polynomial SVM, XGBoost) achieve very similar ROC AUC (~0.79–0.80).

Because the outcome was defined via a median split, classes are approximately balanced. In this setting:

ROC AUC is appropriate.
Accuracy and F1 are also meaningful.
PR AUC would not materially change the ranking.

Model choice is therefore driven by practical considerations:

RBF SVM - Strong performance. - Training complexity scales poorly with n (≈ O(n²)). - Less suitable for large datasets.

XGBoost - Comparable AUC. - Scales efficiently. - Handles nonlinearities automatically. - Strong default for production use.

LightGBM - Typically faster than XGBoost on large tabular datasets. - Particularly efficient with large feature sets.

CatBoost - Handles categorical variables natively. - Would allow us to avoid one-hot encoding occ_group. - Often strong on structured economic data.

Given similar predictive performance, gradient-boosted trees (XGBoost / LightGBM / CatBoost) are typically more scalable choices.

On metric choice

Because classes are balanced, ROC AUC is informative.

However:

If the goal were targeted recruitment of high-wage workers, you might prioritise precision (or Fβ with β<1).
If the goal were policy intervention for low-wage workers, you might prioritise recall (or Fβ with β>1).

$F_β$ (Van Rijsbergen 1979) generalises F1 by weighting recall relative to precision (see the scikit-learn documentation).

Metric choice should reflect the decision context, not just statistical elegance.

🔍 Feature importance: SHAP values

A central challenge in applied machine learning is interpretability: understanding which features drive predictions and by how much.

Logistic regression is interpretable by design — its coefficients are exact, additive feature attributions. No further tools are needed.
XGBoost and SVMs are black-box or quasi-black-box models. SHAP (SHapley Additive exPlanations) adds a layer of interpretability grounded in cooperative game theory, decomposing each prediction into per-feature contributions that are consistent and locally accurate.

We show three panels side by side:

Logistic regression coefficients: direct interpretation, no SHAP needed.
XGBoost SHAP via TreeExplainer: exact and fast for tree-based models.
RBF SVM SHAP via KernelExplainer: model-agnostic: works on any predict function by approximating Shapley values via weighted linear regression over a background sample. We use 100 background rows to keep it tractable; accuracy improves with more rows at the cost of runtime. This slowness is the price of model-agnosticism.

If all three agree on feature ranking and direction, we can be confident the patterns reflect the data. Disagreement points to genuine non-linearity that only the more flexible model captures.

# ── 1. Logistic regression — coefficients directly ───────────────────────────
lr_model = LogisticRegression(max_iter=1000, random_state=RANDOM_STATE)
lr_model.fit(X_train_sub, y_train_sub)
lr_coef  = pd.Series(lr_model.coef_[0], index=ALL_FEATURES)
lr_coef  = lr_coef.reindex(lr_coef.abs().sort_values().index)

# ── 2. XGBoost — TreeExplainer (exact, fast) ──────────────────────────────────
explainer_xgb = shap.TreeExplainer(xgb_rs.best_estimator_)
shap_vals_xgb = explainer_xgb.shap_values(X_train_sub)

# ── 3. RBF SVM — KernelExplainer (model-agnostic, slow) ─────────────────────
# KernelExplainer only needs a predict function — works on any sklearn model.
# Background sample: 50 rows keeps runtime tractable.

def svm_decision(X):
    # Ensure input is DataFrame with correct feature names
    if not isinstance(X, pd.DataFrame):
        X = pd.DataFrame(X, columns=ALL_FEATURES)
    return rbf_gs.best_estimator_.decision_function(X)

# Small background sample
background = shap.sample(X_train_sub, 50, random_state=RANDOM_STATE)

explainer_svm = shap.KernelExplainer(svm_decision, background)

# Explain only 50 observations
X_explain = X_train_sub.iloc[:50]

shap_vals_svm = explainer_svm.shap_values(
    X_explain,
    nsamples=100   # explicitly limit coalition samples
)

# ── Plot three panels ─────────────────────────────────────────────────────────
fig, axes = plt.subplots(1, 3, figsize=(22, 7))

lr_coef.plot(kind='barh', ax=axes[0],
             color=['#d73027' if v < 0 else '#4575b4' for v in lr_coef])
axes[0].axvline(0, color='black', linewidth=0.8)
axes[0].set_xlabel('Coefficient value')
axes[0].set_title('Logistic Regression\ncoefficients (no SHAP needed)', fontsize=11)

plt.sca(axes[1])
shap.summary_plot(shap_vals_xgb, X_train_sub, feature_names=ALL_FEATURES,
                  show=False, plot_size=None)
axes[1].set_title('XGBoost\nSHAP values (TreeExplainer)', fontsize=11)

plt.sca(axes[2])
shap.summary_plot(shap_vals_svm, X_train_sub.iloc[:50], feature_names=ALL_FEATURES,
                  show=False, plot_size=None)
axes[2].set_title('RBF SVM\nSHAP values (KernelExplainer, 50 obs)', fontsize=11)

plt.tight_layout()
plt.show()

How to read a SHAP beeswarm plot

Each horizontal row corresponds to a feature.

Each dot represents one observation.

The x-axis is the SHAP value:
- Positive → pushes the prediction toward the positive class.
- Negative → pushes the prediction away from it.
- Magnitude → strength of influence.
The colour encodes the feature value:
- Red → high feature value
- Blue → low feature value

To interpret:

Look at how wide the row spreads → overall importance.
Look at colour patterns:
- If red dots cluster on the right, high values increase predicted probability.
- If red dots cluster on the left, high values decrease predicted probability.
Look for asymmetry:
- Wide spread on one side suggests nonlinear or threshold effects.
Compare spread across models:
- Linear models tend to produce symmetric spreads.
- Tree or kernel models often show uneven, clustered patterns reflecting interactions.

SHAP explains local contributions, but the beeswarm aggregates them into a global importance view.

❓ Question 6: Do all three representations agree on which features matter most? What does the comparison between logistic regression coefficients and the SHAP plots reveal about the added value of SHAP for the black-box models?

Answer — Q6

All three approaches broadly agree on the dominant drivers:

Occupation dummies (especially management and professional roles)
female
educ_years
Age / experience variables

However, the way importance manifests differs by model.

Logistic regression

Coefficients provide a global, linear effect.
Ranking reflects marginal log-odds contributions.
Effects are symmetric and additive by construction.

XGBoost (Tree SHAP)

Shows wider SHAP dispersion.
Captures nonlinear interactions and threshold effects.
High education values (red) strongly push predictions upward, but low values (blue) have asymmetric negative impact.

RBF SVM (Kernel SHAP)

Similar high-level ranking.
Slightly different magnitudes.
Greater variability reflects the kernel’s nonlinear geometry.

Key insight:

Where SHAP rankings diverge from logistic regression (especially among mid-ranked features), the likely explanation is nonlinear interaction structure that linear regression cannot represent.

Most importantly:

Even in flexible nonlinear models, female remains among the most influential predictors. The consistency of its importance across model classes strengthens the robustness of the adjusted gender gap finding.

Takeaway

Linear models explain via coefficients. Tree models explain via path-dependent splits. Kernel models explain via local similarity structure.

SHAP unifies these under a single additive attribution framework.

🧮 Multinomial logistic regression and the gender wage gap

We now move from a binary outcome to a three-class outcome:

bottom / middle / top

This allows us to ask:

What predicts entry into the top of the wage distribution?
Are effects symmetric across the distribution?

Multinomial logistic regression models the probability of each class simultaneously using a softmax function.

Prepare the multi-class outcome

y_mc = df_enc['wage_tercile']
X_mc_train, X_mc_test, y_mc_train, y_mc_test = train_test_split(
    X, y_mc, test_size=0.25, stratify=y_mc, random_state=RANDOM_STATE
)
print(y_mc_train.value_counts())

wage_tercile
bottom    29205
top       28413
middle    27622
Name: count, dtype: int64

Inspecting class balance (multi-class outcome)

Before fitting a multinomial model, check whether the three wage classes are balanced.

Answer

Because wage_tercile is constructed using terciles, the three classes are designed to be roughly equal in size (again up to ties at cut points). That makes macro-averaged metrics (like macro F1) natural, because each class is treated symmetrically.

Tip

If one class were substantially rarer, would treating all classes symmetrically in evaluation/training still be appropriate?

Fit the multinomial logistic regression

Logistic regression is fast even on the full training set — no subsampling needed here.

multi_logit = LogisticRegression(solver='lbfgs', max_iter=2000,
                                  random_state=RANDOM_STATE)
multi_logit.fit(X_mc_train, y_mc_train)
print('Classes:', multi_logit.classes_)

Classes: ['bottom' 'middle' 'top']

Tip

The multi_class parameter was removed in scikit-learn ≥ 1.5. With lbfgs and a multiclass target, softmax loss is selected automatically.

Docs: LogisticRegression

Extract and display coefficients

coef_ has shape (n_classes, n_features) — one row per wage class. Reshape into a tidy long DataFrame for plotting.

coef_df = (
    pd.DataFrame(multi_logit.coef_,
                 index=multi_logit.classes_,
                 columns=ALL_FEATURES)
    .stack().reset_index()
)
coef_df.columns = ['class', 'feature', 'log_odds']
coef_df['odds_ratio'] = np.exp(coef_df['log_odds'])
coef_df.sort_values(['class', 'log_odds'], ascending=[True, False])

	class	feature	log_odds	odds_ratio
0	bottom	female	0.567733	1.764263
13	bottom	occ_group_service	0.469432	1.599085
2	bottom	potential_exp	0.329139	1.389771
8	bottom	black	0.301952	1.352497
9	bottom	hispanic	0.215816	1.240873
1	bottom	educ_years	0.125842	1.134103
12	bottom	occ_group_sales_office	0.030227	1.030688
7	bottom	white	-0.056761	0.944820
6	bottom	public_sector	-0.177406	0.837440
4	bottom	married	-0.230779	0.793915
3	bottom	age	-0.349274	0.705200
5	bottom	union_mem	-0.397536	0.671974
11	bottom	occ_group_professional	-0.405619	0.666564
10	bottom	occ_group_management	-0.844736	0.429671
19	middle	union_mem	0.173333	1.189262
21	middle	white	0.141766	1.152307
22	middle	black	0.103459	1.109001
20	middle	public_sector	0.100876	1.106140
14	middle	female	0.050908	1.052226
17	middle	age	0.047177	1.048308
18	middle	married	0.034703	1.035312
23	middle	hispanic	-0.023199	0.977068
16	middle	potential_exp	-0.046642	0.954429
24	middle	occ_group_management	-0.049288	0.951906
15	middle	educ_years	-0.051377	0.949921
26	middle	occ_group_sales_office	-0.073761	0.928893
25	middle	occ_group_professional	-0.108074	0.897561
27	middle	occ_group_service	-0.333691	0.716275
38	top	occ_group_management	0.894025	2.444950
39	top	occ_group_professional	0.513693	1.671452
31	top	age	0.302097	1.352692
33	top	union_mem	0.224203	1.251326
32	top	married	0.196076	1.216619
34	top	public_sector	0.076529	1.079534
40	top	occ_group_sales_office	0.043535	1.044496
29	top	educ_years	-0.074466	0.928239
35	top	white	-0.085005	0.918507
41	top	occ_group_service	-0.135740	0.873069
37	top	hispanic	-0.192616	0.824798
30	top	potential_exp	-0.282497	0.753899
36	top	black	-0.405412	0.666702
28	top	female	-0.618641	0.538676

Interpreting the coefficients

In multinomial (softmax) logistic regression:

There is one coefficient vector $\boldsymbol{\beta}_k$ for each class $k$.
$\mathbf{x}_i$ denotes the feature vector for worker $i$.
$\beta_{k,j}$ is the coefficient on feature $j$ for class $k$.

The predicted probability of class $k$ is:

\[ P(y_i = k \mid \mathbf{x}_i) = \frac{\exp(\boldsymbol{\beta}_k^\top \mathbf{x}_i)} {\sum_{m} \exp(\boldsymbol{\beta}_m^\top \mathbf{x}_i)} \]

This is the softmax function.

Interpretation:

A one-unit increase in feature $j$ changes the log-odds of class $k$ relative to the model’s overall normalisation.
$\exp(\beta_{k,j})$ gives the multiplicative change in relative odds for class $k$, holding other features constant.

Unlike binary logistic regression, there is no single explicit reference category in this parameterisation — probabilities are normalised across all classes.

Coefficients from this model (selected features):

Feature	β_bottom	β_middle	β_top	Interpretation
`female`	+0.57	+0.05	−0.62	Being female → odds ratio 0.54 for top tercile — a 46% reduction in relative odds, conditional on all other controls
`educ_years`	+0.13	−0.05	−0.07	Weakly positive for bottom, near-zero elsewhere — explained by softmax normalisation (see below)
`potential_exp`	+0.33	−0.05	−0.28	More experience → higher odds of bottom and lower odds of top, conditional on age and occupation
`occ_group_management`	−0.84	−0.05	+0.89	Management occupation → odds ratio 2.44 for top tercile
`union_mem`	−0.40	+0.17	+0.22	Union membership compresses wages upward — reduces bottom odds, raises both middle and top

On the educ_years sign: The negative coefficient for top may seem surprising — more education reducing top-tercile odds. This is a softmax artefact: the model already controls for age, potential_exp, and occ_group, which are themselves strong positive correlates of both education and wages. Once those are held constant, the marginal education signal shrinks and can even flip sign due to the softmax normalisation. This is not a causal claim — it reflects multicollinearity in the softmax parameterisation.

On potential_exp: The negative sign for top reflects that highly experienced workers (conditional on age, education, and occupation) are not disproportionately concentrated in the top tercile — they cluster in the middle. Early-career workers with high education and management roles drive the top, not tenure per se.

We now plot log-odds and odds ratios for the five focus features.

FOCUS = ['female', 'educ_years', 'potential_exp', 'union_mem', 'public_sector']
plot_df = coef_df[coef_df['feature'].isin(FOCUS)].copy()
order   = ['bottom', 'middle', 'top']
colors  = ['#2196F3', '#9E9E9E', '#F44336']

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
for ax, metric, zero_line, label in [
    (axes[0], 'log_odds',   0, 'Log-odds coefficient (β)'),
    (axes[1], 'odds_ratio', 1, 'Odds ratio (exp(β))'),
]:
    pivot = plot_df.pivot(index='feature', columns='class', values=metric)[order]
    pivot.plot(kind='bar', ax=ax, color=colors, edgecolor='white', width=0.7)
    ax.axhline(zero_line, color='black', linewidth=0.8, linestyle='--')
    ax.set_title(label, fontsize=12); ax.set_xlabel('')
    ax.set_xticklabels([f.replace('_', '\n') for f in pivot.index], rotation=0, fontsize=9)
    ax.legend(title='Wage tercile', fontsize=9)
fig.suptitle('Multinomial logit — selected coefficients (CPS-MORG 2019)', fontsize=13)
plt.tight_layout()
plt.show()

Reading the plot carefully:

female (top class)
β = −0.62 → odds ratio = 0.54.
Holding education, experience, occupation, race, union status, and marital status constant, women have 46% lower relative odds of belonging to the top wage tercile.
female (bottom class)
β = +0.57 → odds ratio = 1.76.
Conditional on the same controls, women have substantially higher relative odds of being in the bottom tercile.
occ_group_management (top)
Odds ratio = 2.44.
Management occupation more than doubles the relative odds of top-tercile membership.
union_mem
Reduces bottom odds (0.67) and increases middle/top odds (>1).
This is consistent with unions compressing wages upward.
educ_years
Small or slightly negative coefficients once occupation and experience are controlled.
This reflects multicollinearity within the softmax parameterisation — not evidence that education reduces top wages.

Key conceptual point:

In a softmax model, coefficients are relative contributions to the probability normalisation across all classes. They are not directly comparable to binary logistic regression coefficients.

Cross-validated macro F1

wage_tercile has string labels — use average='macro' with f1_score.

mc_cv = cross_validate(
    multi_logit, X, y_mc, cv=cv,
    scoring={'f1_macro': 'f1_macro'}
)
scores = mc_cv['test_f1_macro']
print(f'Macro F1: {scores.mean():.4f} ± {scores.std():.4f}')

Macro F1: 0.5374 ± 0.0028

Note on macro F1:

Macro F1 = 0.537 ± 0.003.

This reflects:

Balanced class sizes (terciles),
Equal weighting of bottom, middle, and top,
Averaging of per-class F1 scores.

A macro F1 around 0.53–0.55 is not low in this setting. For comparison, random guessing across three balanced classes would yield macro F1 ≈ 0.33. Thus, 0.54 represents a substantial improvement over baseline.

Why?

Tercile classification is intrinsically difficult:

Observations near the 33rd and 66th percentiles lie close to decision boundaries.
Adjacent-class errors (bottom vs middle, middle vs top) are structurally unavoidable.
Wage distributions are continuous; terciles impose artificial cutoffs.

The very small standard deviation (±0.003) indicates that performance is stable across folds.

Interpretation:

The model captures systematic structure in the wage distribution, but the problem itself contains inherent boundary noise.

Decision boundaries by gender

We now visualise how the predicted wage-tercile regions differ by gender.

Using the same 2D grid as before:

Fix all features at their median.
Set female = 0 in one panel.
Set female = 1 in the other.

Plot the predicted class regions side by side.

class_colors = {'bottom': '#2196F3', 'middle': '#9E9E9E', 'top': '#F44336'}
fig, axes = plt.subplots(1, 2, figsize=(14, 6), sharey=True)

for ax, female_val, title in [
    (axes[0], 0, 'Men (female = 0)'),
    (axes[1], 1, 'Women (female = 1)'),
]:
    g = pd.DataFrame(np.tile(medians.values, (exp_mesh.size, 1)), columns=ALL_FEATURES)
    g['potential_exp'] = exp_mesh.ravel()
    g['educ_years']    = educ_mesh.ravel()
    g['female']        = female_val
    preds = multi_logit.predict(g)
    for cls, color in class_colors.items():
        mask = preds == cls
        ax.scatter(g.loc[mask, 'potential_exp'], g.loc[mask, 'educ_years'],
                   c=color, s=60, marker='s', alpha=0.3, label=cls)
    ax.set_xlabel('Potential experience (years)', fontsize=11)
    ax.set_title(title, fontsize=12); ax.grid(False)

axes[0].set_ylabel('Education (years)', fontsize=11)
handles = [mpatches.Patch(color=c, label=k.capitalize()) for k, c in class_colors.items()]
fig.legend(handles=handles, title='Predicted wage tercile',
           loc='lower center', ncol=3, bbox_to_anchor=(0.5, -0.05))
fig.suptitle('Multinomial logit decision boundaries\n(all other features at median, CPS-MORG 2019)',
             fontsize=13)
plt.tight_layout()
plt.show()

❓ Question 7: Look at the female log-odds for the top class (−0.62, odds ratio 0.54). What does this imply about the adjusted gender wage gap at the top of the distribution? How is this visible in the side-by-side boundary plot?

Answer Q7 (glass ceiling)

An odds ratio of 0.54 for female in the top class means that, conditional on observed characteristics, women have 46% lower relative odds of top-tercile membership than comparable men.

In the boundary plot:

The red (top) region shifts rightward for women.
Women require more education or experience to enter the top region.

This visual shift represents an adjusted glass-ceiling effect within this model.

❓ Question 8: In scikit-learn’s softmax model, how does the female coefficient for top differ from what you would get in a reference-category model (e.g. statsmodels.MNLogit with bottom as base)? When would each parameterisation be preferable?

Answer Q8 (softmax vs reference category)

In scikit-learn’s softmax model:

Each class has its own coefficient vector.
Effects are identified jointly through L2 regularisation.
Coefficients reflect contributions relative to the global softmax normalisation.

In a reference-category model (e.g. MNLogit with bottom as base):

Coefficients are explicit log-odds contrasts relative to the base class.
Interpretation is direct: log-odds(top vs bottom).

Predicted probabilities are equivalent, but coefficients differ numerically.

Use softmax (scikit-learn) for prediction tasks. Use reference-category models (statsmodels) when inference and hypothesis testing with standard errors are central.

❓ Question 9: The model controls for occ_group. Does this mean occupation is fully ‘accounted for’ in the gender gap estimate? What is the debate about whether to include it?

Answer Q9 (occupation debate)

Controlling for occupation isolates the within-occupation gender gap.

However, occupation may lie on the causal pathway from gender to wages (occupational segregation).

If occupation is:

A confounder → controlling is appropriate.
A mediator → controlling absorbs part of the gender effect.

Thus:

Including occupation estimates the conditional gap.
Excluding occupation estimates the total gap.

Both are defensible depending on the causal question.

References

Blau, Francine D., and Lawrence M. Kahn. 2017. “The Gender Wage Gap: Extent, Trends, and Explanations.” Journal of Economic Literature 55 (3): 789–865. https://doi.org/10.1257/jel.20160995.

Mincer, Jacob. 1974. Schooling, Experience, and Earnings. National Bureau of Economic Research, Inc. https://EconPapers.repec.org/RePEc:nbr:nberbk:minc74-1.

Van Rijsbergen, Cornelius Joost. 1979. “Information Retrieval. 2nd. Newton, Ma.” USA: Butterworth-Heinemann.

Materials to download

⚙️ Setup

🥅 Homework objectives

Loading the CPS-MORG 2019 dataset

Feature setup

Inspecting class balance (binary outcome)

Train / test split

Stratified subsample for SVM steps

5-fold cross-validation setup

Evaluation metrics

🗺️ Decision boundaries

Initialise the five models

Build the 2D visualisation grid

1️⃣ obs_grid

2️⃣ vis_grid

1️⃣ obs_grid — a summary of the observed data

2️⃣ vis_grid — a synthetic prediction grid

Why fix all other features at their median?

Fit models and generate predictions

Visualising decision boundaries

📊 5-fold cross-validation

🔧 SVM hyperparameter tuning

RBF SVM

Polynomial SVM

RBF kernel

Polynomial kernel

🌲 XGBoost tuning

On metric choice

🔍 Feature importance: SHAP values

Takeaway

🧮 Multinomial logistic regression and the gender wage gap

Prepare the multi-class outcome

Inspecting class balance (multi-class outcome)

Fit the multinomial logistic regression

Extract and display coefficients

Interpreting the coefficients

Cross-validated macro F1

Decision boundaries by gender

References

1️⃣ `obs_grid`

2️⃣ `vis_grid`

1️⃣ `obs_grid` — a summary of the observed data

2️⃣ `vis_grid` — a synthetic prediction grid