← Back

NLP Classification — LSTM Baselines vs BERT

Course Project

Built text classifiers and compared sequence models (LSTM/RNN-style) against Transformer-based BERT variants, studying regularization and encoder-freezing tradeoffs.

PythonPyTorchNLPLSTMBERTTransformersDropoutWeight Decay

Evaluated regularization choices (dropout + weight decay) and their effect on overfitting

Compared BERT frozen-encoder vs trainable setups for performance vs compute tradeoffs

Problem

Text classification performance depends on representation: older sequence models learn context gradually, while Transformers encode rich contextual meaning using attention.

The challenge isn’t only accuracy — it’s also compute and practicality: LSTMs are lighter but can struggle with long-range dependencies, while BERT is strong but expensive.

The goal was to compare these approaches on a real benchmark (WOS-11967) and explain why one wins under what constraints.

Solution

I built a full NLP pipeline to preprocess the Web of Science dataset and train two models: an LSTM baseline and a BERT-based classifier.

I treated the LSTM as the interpretability/efficiency baseline and used it to learn what matters (tokenization, sequence length, regularization, optimization).

Then I evaluated BERT as the high-capacity contextual model and compared results, including frozen-encoder vs more flexible setups to study compute vs accuracy tradeoffs.

Architecture

Data pipeline: loaded WOS-11967, cleaned and prepared text inputs, built label mappings, and split into train/validation/test sets.

LSTM model: embedded tokens → LSTM encoder → classifier head; trained with dropout/regularization to limit overfitting and improve generalization.

BERT model: Transformer encoder produces contextual representations → classification head on top; compared training strategies (e.g., freezing encoder vs training more layers) depending on compute constraints.

Evaluation: reported results in a single comparison table, highlighting which model wins and why (representation power vs efficiency).

What I optimized

Regularization strategy: tuned dropout/weight decay to reduce overfitting on the LSTM baseline where capacity can memorize patterns in smaller subsets.

Efficiency vs performance: explored BERT training strategies to balance accuracy gains against compute cost.

Reproducibility: structured the experiments so the comparison is apples-to-apples (same splits, same metric reporting, consistent preprocessing assumptions).

Results

Produced a clean head-to-head benchmark table comparing LSTM vs BERT on WOS-11967 and clearly identified the winner.

Observed the expected representation advantage: BERT’s contextual embeddings typically deliver stronger generalization because they encode meaning with attention rather than relying only on sequential memory.

Validated the practical takeaway: LSTM remains useful as a fast baseline, but Transformers dominate when accuracy is the priority and compute is available.

What I'd do next

Run targeted error analysis: identify which categories LSTM fails on (often those needing broader context) and where BERT’s gains are largest.

Try lighter Transformer variants (DistilBERT / smaller encoders) to find the best accuracy-per-compute point.

Add calibration + confidence analysis so the model can signal uncertainty on borderline abstracts instead of forcing a high-confidence guess.