Problem
Academic PDFs are messy in practice: references appear in inconsistent formats, layouts vary by publisher, and “copy the bibliography” quickly turns into manual cleanup. That breaks downstream tasks like citation search, reference quality checks, and building reliable summaries that actually point to the right sources.
A purely rule-based approach fails on edge cases (weird line breaks, columns, headings, incomplete metadata), but a purely LLM-based approach is expensive, slower, and can hallucinate structure when the PDF text extraction is noisy.
The core challenge of the capstone was turning unstructured PDFs into a structured representation (references, citations, metadata, sections) that is accurate enough to trust, fast enough to scale across many documents, and robust to formatting variance.
Solution
We built a hybrid research assistant platform that converts PDFs into structured data by combining deterministic extraction (where it’s reliable) with LLM-based cleanup/extraction (where PDFs get messy). The key idea is: don’t ask an LLM to invent structure from chaos—first extract and normalize as much as possible deterministically, then use the model surgically to resolve ambiguity and edge cases.
The pipeline treats ingestion as a real system, not a script. Users upload a document once, it gets stored durably, then background processing extracts text/layout, detects sections, identifies and normalizes references, enriches metadata, and produces structured outputs for downstream features like summarization and citation quality scoring.
For citations/references, we implemented normalization + deduplication so the same paper isn’t stored five different ways because of punctuation differences, unicode variants, or inconsistent author/title formatting. This made the system more accurate and significantly improved retrieval and ‘reference quality’ behavior.
To enrich documents beyond what the PDF provides, we integrated Google Scholar-style scraping (as a SerpAPI replacement path) so the platform can pull missing metadata and improve reference resolution (titles, authors, venue/year). This is especially important when PDFs are incomplete or references are abbreviated.
Architecture
Backend API: FastAPI (Python) as the service boundary for ingestion and analysis endpoints. We structured routes around the PDF Analyzer pipeline so uploads, parsing, enrichment, and results retrieval are cleanly separated and testable.
Parsing layer: A multi-stage extraction flow using deterministic parsing for layout/structure and reference detection (regex/pattern extraction + segmentation) with a model-assisted step for messy cases. This hybrid approach improves robustness while controlling cost and failure modes.
Storage: AWS S3 as the source of truth for PDF storage (durable, scalable ingestion). We use hashed/deduplicated identifiers so repeated uploads don’t create redundant processing and so filenames can be normalized safely.
Data + caching: Postgres stores structured outputs (documents, references, extracted fields, enrichment results). Redis is used for caching/intermediate state, and RedisTimeSeries tracks time-based signals/metrics to understand pipeline performance and system behavior under load.
Metadata enrichment: Scholar-style scraping pipeline with fallback logic and deduplication. The enrichment step boosts reference quality by filling in missing fields and standardizing metadata.
Deployment: docker-compose environment coordinating the API + supporting services so the system runs consistently across machines and can scale without ‘works on my laptop’ drift. Async execution ensures long-running parsing/enrichment doesn’t block user requests.
What I optimized
Hybrid reliability: I focused on getting the best of both worlds—deterministic logic for what’s stable (structure, obvious fields, normalized parsing) and LLM use only where it adds real value (formatting edge cases, noisy reference blocks). This improved extraction quality without turning the pipeline into an expensive ‘LLM everywhere’ system.
Normalization + deduplication: I implemented aggressive cleaning/normalization (unicode normalization, consistent formatting, hashing/dedupe strategies) so references and metadata remain consistent across documents and repeated uploads. This directly improves retrieval, reduces redundant processing, and prevents analytics from being polluted by duplicates.
Performance + throughput: We designed ingestion as async-first so a user can upload and continue, while processing runs in the background. This avoids blocking requests and makes the system behave like a platform rather than a single-threaded script.
System consistency: Dockerized coordination of services + clear API boundaries so each part of the pipeline is composable, easier to debug, and reproducible across the team environment.
Results
Built an end-to-end platform that turns PDFs into structured data (references/citations/metadata + extracted content) and supports downstream research workflows like summarization and citation quality checks.
Improved citation/reference extraction accuracy by ~40% by combining deterministic parsing with a targeted LLM cleanup/extraction step, plus normalization/deduplication to reduce noisy variants of the same reference.
Implemented Scholar-style metadata enrichment so documents become more complete and references become more resolvable even when the PDF is messy or missing key fields.
Delivered a scalable ingestion workflow: PDFs stored durably in S3, structured results stored in Postgres, caching/metrics via Redis/RedisTimeSeries, and async processing so heavy work doesn’t block the API.
Coordinated system design and integration across a multi-person capstone environment, ensuring consistent deployment and predictable behavior across services.
What I'd do next
Make reference resolution more ‘entity-based’: add a canonical paper identity layer (DOI/arXiv/semantic fingerprinting) so references map to a single truth even when formatting differs wildly. This would make citation graphs and analytics significantly more reliable.
Improve observability and quality metrics: expand RedisTimeSeries/telemetry into a full pipeline dashboard that tracks per-stage failure rates, extraction confidence, and ‘where the pipeline breaks’ by PDF type/publisher to guide targeted improvements.
Add human-in-the-loop correction: a lightweight UI to edit/confirm extracted references/metadata, then feed corrections back into the system (rules tuning + model prompts) so the platform improves over time with real usage.
Harden enrichment: broaden coverage beyond Scholar-style scraping by adding additional resolvers (venue pages, open metadata sources, fallback heuristics), with rate limiting and caching so enrichment scales safely.
Turn structured data into user-facing workflows: semantic search over extracted sections/references, ‘related works’ discovery, and citation-quality scoring that explains *why* a citation is relevant (not just a score).