Problem
Text classification performance depends on representation: older sequence models learn context gradually, while Transformers encode rich contextual meaning using attention.
The challenge isn’t only accuracy — it’s also compute and practicality: LSTMs are lighter but can struggle with long-range dependencies, while BERT is strong but expensive.
The goal was to compare these approaches on a real benchmark (WOS-11967) and explain why one wins under what constraints.
Solution
I built a full NLP pipeline to preprocess the Web of Science dataset and train two models: an LSTM baseline and a BERT-based classifier.
I treated the LSTM as the interpretability/efficiency baseline and used it to learn what matters (tokenization, sequence length, regularization, optimization).
Then I evaluated BERT as the high-capacity contextual model and compared results, including frozen-encoder vs more flexible setups to study compute vs accuracy tradeoffs.
Architecture
Data pipeline: loaded WOS-11967, cleaned and prepared text inputs, built label mappings, and split into train/validation/test sets.
LSTM model: embedded tokens → LSTM encoder → classifier head; trained with dropout/regularization to limit overfitting and improve generalization.
BERT model: Transformer encoder produces contextual representations → classification head on top; compared training strategies (e.g., freezing encoder vs training more layers) depending on compute constraints.
Evaluation: reported results in a single comparison table, highlighting which model wins and why (representation power vs efficiency).
What I optimized
Regularization strategy: tuned dropout/weight decay to reduce overfitting on the LSTM baseline where capacity can memorize patterns in smaller subsets.
Efficiency vs performance: explored BERT training strategies to balance accuracy gains against compute cost.
Reproducibility: structured the experiments so the comparison is apples-to-apples (same splits, same metric reporting, consistent preprocessing assumptions).
Results
Produced a clean head-to-head benchmark table comparing LSTM vs BERT on WOS-11967 and clearly identified the winner.
Observed the expected representation advantage: BERT’s contextual embeddings typically deliver stronger generalization because they encode meaning with attention rather than relying only on sequential memory.
Validated the practical takeaway: LSTM remains useful as a fast baseline, but Transformers dominate when accuracy is the priority and compute is available.
What I'd do next
Run targeted error analysis: identify which categories LSTM fails on (often those needing broader context) and where BERT’s gains are largest.
Try lighter Transformer variants (DistilBERT / smaller encoders) to find the best accuracy-per-compute point.
Add calibration + confidence analysis so the model can signal uncertainty on borderline abstracts instead of forcing a high-confidence guess.