← Back

ML Foundations — Regression & Optimization on Real Data

Course Project

Built linear + logistic regression from scratch (GD + mini-batch SGD), then improved generalization with regularization and calibration on Parkinson’s + Breast Cancer datasets.

PythonNumPyPandasMatplotlibGradient DescentSGDRidge/L2

Implemented linear + logistic regression training loops (batch + mini-batch SGD)

Studied regularization + calibration to improve generalization and reliability

Problem

Real-world datasets rarely behave like clean textbook examples: features live on different scales, correlations can be misleading, and naive training can converge slowly or to unstable solutions.

I wanted to understand the full end-to-end pipeline behind common models (linear regression and logistic regression) — not just using libraries, but implementing the learning loop and seeing how optimization choices change convergence and generalization.

For classification, accuracy alone isn’t enough: a model can be ‘right’ but still output unreliable probabilities. That matters when predictions are used for downstream decisions.

Solution

I implemented linear regression and logistic regression from scratch, including the full training loop, gradient computation, and evaluation metrics.

I trained using both full-batch Gradient Descent and mini-batch SGD to compare speed, stability, and sensitivity to hyperparameters.

To reduce overfitting and improve reliability, I added regularization (Ridge/L2) and studied probability calibration to make predicted confidences more trustworthy.

Architecture

Data pipeline: loaded datasets, removed non-informative ID columns, checked missing values/duplicates, and ran basic statistics + correlation exploration to understand feature behavior.

Linear regression (Parkinson’s): implemented MSE/SSE objective, computed gradients, trained with GD/mini-batch SGD, and tracked train/validation error to detect over/underfitting.

Logistic regression (Breast Cancer): implemented sigmoid + cross-entropy loss, trained with GD/SGD, and evaluated using classification metrics (not just raw loss).

Regularization + calibration: applied L2 penalty to control weight growth and compared how it affects generalization; examined probability outputs to ensure confidence aligns with correctness.

What I optimized

Training stability: compared batch size effects (variance vs convergence smoothness) and tuned learning rates to avoid divergence or painfully slow training.

Generalization: used validation-driven iteration with L2 regularization to reduce overfitting, especially when features were correlated or noisy.

Interpretability: relied on clear plots (loss curves, error trends, correlation heatmaps) to justify decisions instead of guessing hyperparameters blindly.

Results

Produced an end-to-end reproducible pipeline that trains linear and logistic regression models from scratch and compares GD vs mini-batch SGD behavior.

Demonstrated clear convergence differences: batch GD produced smoother learning curves, while mini-batch SGD reached good solutions faster but introduced noisier updates.

Showed that regularization improves validation performance and makes the model more robust to noisy/high-variance features; calibration analysis highlighted the gap between ‘accuracy’ and ‘trustworthy confidence.’

What I'd do next

Add stronger baselines (scikit-learn) and report side-by-side performance + runtime to quantify the tradeoff between ‘from scratch’ control and optimized library implementations.

Extend evaluation beyond a single split: K-fold cross-validation + confidence intervals for more statistically stable comparisons.

Package the training loops into a clean mini-library (fit/predict/score + plotting utilities) with unit tests so experiments become fast and reusable.