Skip to content

Topic Modeling

Topic modeling in ads-bib runs in four sub-stages: embeddings → dimensionality reduction → clustering → LLM labeling. The package wraps two upstream topic-model libraries and exposes them as interchangeable backends:

Both libraries are used as-is; ads-bib owns the data, provider, and export pipeline around them.

Data Flow

graph LR
    A[Title/Abstract text] --> B[Embeddings<br/>full-dim vectors]
    B --> C[Reduction → 5D<br/>for clustering]
    B --> D[Reduction → 2D<br/>for visualization]
    C --> E[Clustering<br/>fast_hdbscan]
    E --> F[LLM labeling]
    F --> G[Topic dataframe]
    D --> G

Two reduced spaces are computed from the same embedding matrix. The 5D space is the one that clustering sees — it keeps enough structure for the HDBSCAN family clusterer (default: fast_hdbscan in the pipeline) to separate dense regions. The 2D space is only ever the map coordinate system. The topic assignments come out of the 5D path and are then rendered on the fixed 2D layout. Never tune clustering against the 2D projection.

Choose a Backend

bertopic toponymy
Topology flat (one layer) hierarchical (topic_layer_<n>_*)
Use when you want one flat topic list for curation and visualization you need semantic drill-down from coarse to fine
Preset default hf_api, local_cpu, local_gpu openrouter
Downstream columns topic_id, Name topic_layer_<n>_id, topic_layer_<n>_label, plus topic_id / Name as working-layer aliases

Toponymy keeps topic_id and Name as compatibility aliases for a selected "working layer" so every downstream tool (curation, visualization, citation export) behaves identically. toponymy_layer_index="auto" picks the coarsest available overview layer; set an explicit integer to pin it.

Provider Matrix

All four roads use the same model stack across BERTopic and Toponymy. The local-road labeling defaults are intentionally asymmetric: local_cpu keeps GGUF via llama_server as the default, local_gpu runs local Transformers.

Road Embeddings BERTopic labeling Toponymy labeling
openrouter OpenRouter OpenRouter OpenRouter
hf_api HF Inference API HF Inference API HF Inference API
local_cpu local SentenceTransformers llama_server (GGUF) llama_server (GGUF)
local_gpu local SentenceTransformers local transformers local transformers

On local roads, both backends can still switch between llama_server and local via topic_model.llm_provider. Remote roads are uniform — the same provider handles embeddings and labeling.

Embeddings

Embeddings are the expensive semantic step. They are cached under data/cache/embeddings/ keyed by model name and a SHA-256 hash of the input texts. Cache hits are instant; cache misses re-embed the full corpus.

Default preset models:

Road Embedding model
openrouter qwen/qwen3-embedding-8b
hf_api Qwen/Qwen3-Embedding-8B
local_cpu google/embeddinggemma-300m
local_gpu google/embeddinggemma-300m

For local roads, the active Torch build decides whether these run on CPU or CUDA. For early exploration on a large corpus, set topic_model.sample_size to limit documents and set it back to null for the final run.

To compare embedding models from the same corpus, start a variant from a completed run:

ads-bib run --from-run run_20260407_120000_ads_bib_openrouter \
  --set topic_model.embedding_model=google/gemini-embedding-001

This keeps search/export, translation, tokenization, and optional AND outputs, then recomputes embeddings and later topic artifacts.

When using the OpenRouter preset with Toponymy, toponymy_embedding_model is set separately to qwen/qwen3-embedding-8b. That keeps Toponymy's short keyphrase/name embeddings fast even when the main document embedding model is swapped for a larger model.

Common failure patterns (embeddings)

  • Cache miss you did not expect (the corpus re-embeds in full) → check whether embedding_model or the input text changed; the cache key is a SHA-256 of both.
  • Out-of-memory on a local road → set topic_model.sample_size for exploration, lower embedding_batch_size, or switch to a smaller embedding model for that run.
  • Slow embeddings on local_cpugoogle/embeddinggemma-300m is the CPU-friendly default. Avoid multi-billion-parameter embedding models on CPU.

Reduction

The default reduction method is pacmap. umap is available as an advanced override and is the reason the optional ads-bib[umap] extra still exists.

Official presets use:

reduction_method: pacmap
params_5d:
  n_neighbors: 30
  metric: angular
  random_state: 42
params_2d:
  n_neighbors: 30
  metric: angular
  random_state: 42

n_neighbors has the most visible impact. Higher values (50–80) produce broader, connected clusters; lower values (15–30) produce tighter, separated groups. Scale down for datasets under 200 documents.

Common failure patterns (reduction)

  • 2D map collapsed into one blobn_neighbors is too high; drop toward 15–30.
  • 2D map shattered into isolated specksn_neighbors is too low; raise toward 50–80.
  • Corpus under 200 documents → set n_neighbors to 15–20; the default of 30 assumes enough points to estimate neighborhoods.

Clustering

Official presets use fast_hdbscan with:

cluster_params:
  min_cluster_size: 15
  min_samples: 3
  cluster_selection_method: eom
  cluster_selection_epsilon: 0.05

hdbscan stays available as an advanced override and is the reason the optional ads-bib[hdbscan] extra still exists. For corpora below 500 documents, keep min_cluster_size at 15. For larger corpora, the auto-scaling formula max(15, n_docs * 0.001) kicks in. Override via cluster_params.min_cluster_size.

Toponymy currently uses Fast-HDBSCAN internals through the 0.2.x call signature, so the package pins fast-hdbscan>=0.2.2,<0.3. This is separate from toponymy_max_workers, which controls concurrent remote labeling and embedding calls rather than clustering threads. For API-based Toponymy-internal embeddings, toponymy_embedding_batch_size controls how many keyphrases or topic names are sent per embedding request.

Common failure patterns (clustering)

  • Too few topics (2–3) with a large outlier set → lower min_cluster_size.
  • Too many micro-topics with <10 documents each → raise min_cluster_size.
  • Noisy borders → raise min_samples from 2 to 3 or higher.

Common clustering variants:

# Switch the topic backend.
ads-bib run --from-run <run_id> --set topic_model.backend=toponymy

# Tune the flat BERTopic clusterer.
ads-bib run --from-run <run_id> --set topic_model.cluster_params.min_cluster_size=30

# Tune the Toponymy clusterer.
ads-bib run --from-run <run_id> --set topic_model.toponymy_cluster_params.min_clusters=8

All three start at topic_fit, so embeddings and reductions are reused.

Labeling

Labeling names each cluster via an LLM. Pick a prompt with llm_prompt_name (physics for gravitational physics, generic for domain-agnostic), or override with llm_prompt. bertopic_label_max_tokens and toponymy_local_label_max_tokens cap label length.

For BERTopic, llm_prompt replaces the full labeling prompt. For Toponymy, it appends extra naming instructions to Toponymy's built-in prompt templates, which is the preferred way to keep hierarchy labels concise without forking the backend prompt logic.

For BERTopic, representation runs a POS filter → KeyBERT → MMR → LLM before the final label emerges. Outlier reassignment uses outlier_threshold (default 0.5) — documents with assignment probability above that threshold get pulled into their nearest cluster, then topic labels are refreshed.

Common failure patterns (labeling)

  • Generic or empty labels (e.g. "Topic 1") → check llm_prompt_namephysics is tuned for gravitational physics, generic for other domains.
  • Labeling times out or intermittently fails → lower toponymy_max_workers, or switch the LLM provider for that run.
  • Outliers swallow most documents → raise outlier_threshold so more documents get reassigned into their nearest cluster before labels are regenerated.

Labeling-only variants also start at topic_fit because labels are part of the fitted topic model:

# Use a different labeler model.
ads-bib run --from-run <run_id> --set topic_model.llm_model=google/gemini-3-flash-preview

# Switch from the physics prompt to the generic prompt.
ads-bib run --from-run <run_id> --set topic_model.llm_prompt_name=generic

Good Tuning Order

  1. Keep the query fixed.
  2. Choose the backend: bertopic or toponymy.
  3. Inspect embeddings quality: does the 2D scatter look structured at all?
  4. Tune n_neighbors in params_5d if clusters look too merged or too fragmented.
  5. Tune cluster_params.min_cluster_size and min_samples for granularity.
  6. For Toponymy, tune toponymy_cluster_params in this order: min_clustersbase_min_cluster_sizebase_n_clustersnext_cluster_size_quantile.
  7. Leave toponymy_layer_index="auto" unless you need a fixed working layer.
  8. Only after that, experiment with labeling prompts or models.

For CLI iteration, prefer a variant from the last complete run:

ads-bib run --from-run <run_id> --set topic_model.cluster_params.min_cluster_size=30

Use --dry-run first when you want to confirm which stages will be reused.

For raw config keys, see Configuration. For phase-level tuning advice across the full pipeline, see the Pipeline Guide.