Blog

Building FMAP: an interactive astro-ph paper atlas with embeddings, UMAP, and arXiv metadata

FindMyArxivPaper (FMAP) started as a simple scientific-paper exploration pipeline and grew into a more coherent astrophysics map: a project that fetches astro-ph metadata from arXiv, learns useful text representations from titles and abstracts, evaluates both classification and retrieval, and renders the resulting paper landscape as an interactive browser-based map.

Python Sentence Transformers UMAP LinearSVC Astrophysics Interactive Visualization

Open 10k interactive atlas View GitHub repo / full project

Interactive atlas

The atlas below is the main output of the project. Each point is a paper, coloured by astro-ph category. The positions are not hand-made: they come from embedding each paper using title + abstract, then projecting those vectors into 2D with UMAP so that local neighborhoods stay visually meaningful.

The embedded website version is intentionally capped to a lighter public snapshot (10k points) so it stays fast and small enough for GitHub Pages. The full project and larger local atlas runs live in the FMAP repository.

View the public interactive snapshot here: /fmap/index.html. For the full project and larger atlas outputs, use the external repository link above.

What the model is doing

FMAP deliberately uses two representations of the same paper dataset, because the project needs two kinds of behavior.

1. Category classification

For supervised category prediction, FMAP uses TF-IDF with unigram and bigram features over the concatenated title and abstract, followed by a LinearSVC. This is fast, interpretable, and surprisingly strong for scientific text, where categories often carry stable terminology.

2. Retrieval and map structure

Separately, FMAP embeds every paper using sentence-transformers/all-MiniLM-L6-v2. Those dense vectors are used for semantic search, nearest-neighbor recommendations, retrieval metrics, and the atlas geometry itself.

Why UMAP instead of PCA?

Earlier versions used PCA for the 2D map. PCA is easy and quick, but it often spreads semantic clusters too evenly and flattens the interesting local geometry. For a paper atlas, that matters: if nearby points do not feel meaningfully related, the map becomes decorative rather than useful.

UMAP gives a much better compromise for this kind of visualization. It preserves local neighborhood structure more faithfully, makes clusters feel more natural, and gives the atlas a denser, more legible shape. In FMAP, the current map uses a moderate neighborhood size with a low minimum distance to keep the local structure compact without turning it into noise.

Main results and figures

FMAP writes out a small set of useful diagnostics after each run: category balance, year distribution, a 2D embedding view, and a confusion matrix for the supervised classifier. Together they tell you not just whether the model performs reasonably, but also what kind of corpus the atlas actually represents.

FMAP classifier confusion matrix — **Confusion matrix.** The classifier is not perfect, but the strong diagonal shows that category-specific language is learnable. The remaining confusions are mostly between nearby astrophysical subfields rather than random failure.

FMAP category distribution — **Category distribution.** This figure makes it obvious whether the atlas is balanced or dominated by a few astro-ph labels.

FMAP published year distribution — **Published year distribution.** This was especially useful while debugging arXiv ingestion. It revealed when a nominally large query was still only pulling a recent slice instead of a true historical range.

How the pipeline works end-to-end

Query arXiv for astro-ph categories, optionally year-by-year over a historical range.
Parse metadata and deduplicate records by arXiv URL.
Build combined_text = title + abstract.
Split the dataset into training and test partitions.
Train a TF-IDF + LinearSVC classifier for category prediction.
Embed every paper using a sentence-transformer.
Evaluate retrieval using nearest neighbors in embedding space.
Project embeddings to 2D with UMAP and generate the interactive HTML atlas.

Why I think this project is interesting

FMAP is a good example of the kind of technical work I enjoy: not just training a model in isolation, but building a full loop from data acquisition to representation learning to evaluation to a public-facing interface. The interesting part is the combination: machine learning, scientific text, visualization, and enough systems glue to make the output actually usable.

It is also a nice reminder that project quality is often about structure rather than only model novelty. A simple classifier can be perfectly sensible if the embeddings, atlas, diagnostics, and interaction design make the overall system coherent.