Problem
Academic research workflows are fragmented across too many tools — one for literature search, another for citation management, another for writing, another for peer review. Switching between them is slow, and none of them understand each other's context. Researchers spend more time managing tools than doing research.
At the document level, academic PDFs are structurally messy: references appear in inconsistent formats across publishers, layouts break text extraction, and metadata is often incomplete or missing entirely. A purely rule-based approach fails on edge cases, but a purely LLM-based approach is expensive and can hallucinate structure when extraction is noisy.
The core challenge was building a platform accurate enough to trust across the full research lifecycle — from discovering literature to producing a final manuscript — while keeping AI use surgical rather than treating every step as a generation problem.
Solution
POP AI is an integrated research assistant that covers the full academic workflow in seven features, all built on a shared Next.js frontend and FastAPI backend, with AWS S3 as the storage layer and OpenAI GPT models handling generation and analysis tasks.
Feature 1 automated the bi-weekly synchronization of the Semantic Scholar dataset to AWS S3 — eliminating manual data management and keeping the platform's literature database current without intervention.
Feature 2 replaced a costly third-party Google Scholar API with a compliant in-house scraping solution using ScraperAPI and BeautifulSoup, giving full control over literature retrieval while preserving the same output format for all downstream components.
Feature 3 is the PDF Citation Analyzer: it extracts all references from an uploaded paper, retrieves their metadata from S3, matches each citation to the most relevant passage in the body text, and scores relevance using GPT. This semester it was substantially rebuilt with parallel processing, automatic citation style detection across 9 formats, and deterministic regex-based block matching — reducing average processing time from 3.2s to 1.1s per reference while maintaining ~100% accuracy across 533 references on 18 papers.
Feature 4 is the Grant Writing Assistant: it parses grant guideline PDFs, analyzes the user's project file against the funder's requirements, surfaces missing information and clarifying questions, and produces a structured outline and full narrative draft in a six-step wizard.
Feature 5 is the Research Paper Reviewer: it extracts individual claims from a submitted paper, retrieves semantically relevant evidence using text embeddings, and returns structured feedback identifying logical weaknesses, unsupported claims, and overstated conclusions — each with a risk score and suggested rewrite.
Feature 6 is the Research Paper Writer: a six-step wizard that guides users through paper type selection, requirements, content input, outline generation, section-by-section drafting, and final manuscript export as DOCX or PDF.
Across Features 3–6, a collapsible history sidebar lets users save, search, and reload past sessions — making the platform feel like a persistent workspace rather than a one-shot tool.
Architecture
Backend: FastAPI (Python) as the service boundary for all ingestion, analysis, and generation endpoints. Routes are structured around each feature's pipeline so uploads, processing, enrichment, and results retrieval are cleanly separated and independently testable.
Storage: AWS S3 as the source of truth for PDF storage with hashed/deduplicated identifiers. PostgreSQL stores structured outputs — documents, references, extracted fields, enrichment results, and session state. Redis handles caching and intermediate state, with RedisTimeSeries tracking pipeline performance metrics under load.
PDF processing: pdfminer.six for text extraction across multi-column layouts, PyMuPDF for the Paper Reviewer where cleaner extraction matters, and PyPDF2 as fallback for complex grant guideline PDFs. All extraction is followed by normalization and deduplication before any LLM call.
Citation pipeline (F3): Multi-stage extraction using deterministic regex-first splitting and title extraction, with GPT as fallback only. Automatic citation style detection votes across the first 20 references to identify format across 9 styles. Block matching uses numeric regex for bracket/Vancouver styles and author-year regex for APA/Harvard — eliminating GPT from the matching step entirely. All references processed concurrently via ThreadPoolExecutor.
Paper Reviewer (F5): Claim extraction using GPT-4o-mini, semantic retrieval using text-embedding-3-large to find the top-k most relevant evidence spans per claim, and GPT verdict generation returning risk scores between 0 and 1. Configurable threshold (default 0.55) for flagging.
Grant Writer and Paper Writer (F4, F6): Multi-stage GPT-4.1 pipelines with chunk-and-merge for long PDFs, structured JSON outputs validated before being passed to the frontend, and section-by-section drafting that feeds prior context forward to maintain coherence.
Deployment: docker-compose environment coordinating the API and all supporting services for consistent behavior across machines. Async execution throughout so long-running parsing and enrichment tasks never block user requests.
My Contributions
I contributed across all seven features over two semesters, with primary ownership of the Google Scholar proxy (F2), the PDF Citation Analyzer rebuild (F3), and the Research Paper Reviewer (F5), and supporting contributions across the Grant Writer (F4) and Paper Writer (F6) including frontend work and integration.
For F2, I built the SerpAPI replacement — a compliant scraping solution using ScraperAPI, BeautifulSoup for HTML parsing, asyncio for concurrent execution, and normalization/deduplication to match the original output format so no downstream components needed to change.
For F3, I led the major rebuild: parallel processing with ThreadPoolExecutor, automatic citation style detection across 9 formats, regex-first splitting and title extraction with GPT as fallback only, and deterministic block matching that eliminated GPT from the matching step entirely — cutting processing time from ~3.2s to ~1.1s per reference across 533 references on 18 papers.
For F5, I contributed to backend design, prompt engineering, and frontend integration for the Research Paper Reviewer — a claim-centric pipeline using text embeddings for evidence retrieval and GPT for verdict generation, returning structured feedback with risk scores and suggested rewrites.
For F4 and F6, I contributed to frontend development and integration work — helping ship the Grant Writer and Paper Writer wizard interfaces and ensuring consistent behavior across the shared platform.
What I optimized
Hybrid reliability: deterministic logic for what's stable — structure, obvious fields, normalized parsing — and LLM use only where it adds real value. This improved extraction quality without turning the pipeline into an expensive 'LLM everywhere' system.
Citation pipeline performance: the combination of parallel execution, lighter model (GPT-4.1-nano), regex-first processing, and deterministic block matching cut average processing time from ~3.2s to ~1.1s per reference. The changes were fully backward-compatible — same endpoint, same JSON output format.
Normalization and deduplication: aggressive unicode normalization, consistent formatting, and hashing strategies so references and metadata stay consistent across documents and repeated uploads. This directly improves retrieval quality and prevents analytics from being polluted by duplicate variants of the same paper.
Scraping reliability: ScraperAPI with rotating IPs and timeout handling, asyncio for concurrent requests, and output normalization to match SerpAPI's format exactly — so no downstream components needed to change when the replacement was deployed.
Results
Seven features shipped end-to-end across two semesters covering the complete academic research workflow — from automated literature sync to final manuscript export.
PDF Citation Analyzer: rebuilt pipeline reduced processing time from ~3.2s to ~1.1s per reference while maintaining ~100% extraction accuracy, validated across 533 references on 18 papers spanning 9 citation styles.
Google Scholar proxy: fully replaced SerpAPI with a compliant scraping solution that passed all integration tests and produced cleaner, better-deduplicated results than the original implementation.
Research Paper Reviewer: consistently identified missing citations, overly broad conclusions, and methodological gaps across test papers, with structured JSON output stable across all test cases.
Grant Writer and Paper Writer: end-to-end pipelines successfully transformed raw guidelines and project documents into structured grant materials and complete downloadable manuscripts — validated with real grant documents and multiple paper topics.
History sidebar added across all four interactive features, backed by session models and CRUD endpoints, making the platform feel like a persistent workspace.
What I'd do next
Combine the Paper Reviewer and Paper Writer into a unified writing agent — so the system can review a draft, identify weaknesses, and iterate on specific sections in a single workflow rather than two separate tools.
Add a canonical paper identity layer (DOI/arXiv/semantic fingerprinting) to the citation pipeline so references map to a single truth even when formatting differs wildly across papers.
Expand the grant writer to support multi-document project portfolios — letting users attach multiple supporting files that get analyzed together rather than as a single project document.
Build observability into the citation pipeline: a dashboard tracking per-stage failure rates, extraction confidence by citation style and publisher, and which reference types have the highest skip rate — to guide targeted improvements.
Move from monthly snapshot projections to per-document intelligence in the literature pipeline: semantic search over extracted sections, 'related works' discovery, and citation-quality scoring that explains why a citation is relevant rather than just assigning a score.