About this project
Self-hosted academic literature management and citation graph platform. Ingest papers via PDF upload or DOI, enrich metadata via CrossRef and Semantic Scholar APIs, generate 384-dim vector embeddings for semantic search with pgvector, and visualise citation networks as interactive D3 force graphs. Full async stack: FastAPI, Celery workers, PostgreSQL + pgvector, Redis, Alembic migrations, Obsidian/Zotero/GraphML export.
Background
During the final stages of my PhD, I was managing close to 400 papers across Zotero, PDFs scattered across drives, and browser bookmarks I'd stopped trusting. The problem wasn't access — it was findability. I knew a paper about liminal space existed somewhere in that pile, but keyword search across filenames wasn't going to find it. What I actually wanted was semantic search: show me papers that are conceptually related to this one, regardless of whether they share exact terminology.
The starting point was straightforward: ingest a DOI, hit CrossRef and Semantic Scholar, store structured metadata. But once I had metadata, I wanted to understand the citation graph — which papers were foundational, which were peripheral, where the clusters of closely related work sat. That led to NetworkX for graph analytics, PageRank for influence scoring, and D3 force simulation for interactive visualisation. The visual output was immediately useful: you can see at a glance where the dense theoretical clusters are and which papers bridge otherwise separate bodies of work.
The vector search layer came later. Sentence-transformers with all-MiniLM-L6-v2 generates embeddings per abstract, stored in pgvector with an IVFFlat index. Querying by semantic similarity rather than keyword changes how you navigate a literature — instead of searching for a paper you remember, you describe the idea and let the index surface what's relevant. The Celery worker architecture keeps ingestion async so the API stays responsive even when enrichment hits rate limits on external APIs.
Export to Obsidian vault was a practical addition — I wanted to write directly in the same environment I was using for notes, with citation backlinks wired in automatically. The Zotero RDF and GraphML exports were for collaborators who hadn't made the switch. It's still in active use.
Highlights
- Async ingestion pipeline: PDF → PyMuPDF → CrossRef → Semantic Scholar → ORCID → embed
- pgvector cosine-distance semantic search with IVFFlat index for sub-millisecond queries
- NetworkX graph analytics: PageRank, betweenness centrality, label-propagation clustering
- D3 v7 force simulation with node radius = PageRank, colour = community cluster
- Export to Obsidian vault, Zotero RDF, and GraphML in a single API call