Topic Modeling¶
Topic modeling in ads-bib runs in four sub-stages:
embeddings → dimensionality reduction → clustering → LLM labeling. The
package wraps two upstream topic-model libraries and exposes them as
interchangeable backends:
Both libraries are used as-is; ads-bib owns the data, provider, and export
pipeline around them.
Data Flow¶
graph LR
A[Title/Abstract text] --> B[Embeddings<br/>full-dim vectors]
B --> C[Reduction → 5D<br/>for clustering]
B --> D[Reduction → 2D<br/>for visualization]
C --> E[Clustering<br/>fast_hdbscan]
E --> F[LLM labeling]
F --> G[Topic dataframe]
D --> G
Two reduced spaces are computed from the same embedding matrix. The 5D space
is the one that clustering sees — it keeps enough structure for the HDBSCAN
family clusterer (default: fast_hdbscan in the pipeline) to
separate dense regions. The 2D space is only ever the map coordinate system.
The topic assignments come out of the 5D path and are then rendered on the
fixed 2D layout. Never tune clustering against the 2D projection.
Choose a Backend¶
bertopic |
toponymy |
|
|---|---|---|
| Topology | flat (one layer) | hierarchical (topic_layer_<n>_*) |
| Use when | you want one flat topic list for curation and visualization | you need semantic drill-down from coarse to fine |
| Preset default | hf_api, local_cpu, local_gpu |
openrouter |
| Downstream columns | topic_id, Name |
topic_layer_<n>_id, topic_layer_<n>_label, plus topic_id / Name as working-layer aliases |
Toponymy keeps topic_id and Name as compatibility aliases for a selected
"working layer" so every downstream tool (curation, visualization, citation
export) behaves identically. toponymy_layer_index="auto" picks the coarsest
available overview layer; set an explicit integer to pin it.
Provider Matrix¶
All four roads use the same model stack across BERTopic and Toponymy. The
local-road labeling defaults are intentionally asymmetric: local_cpu keeps
GGUF via llama_server as the default, local_gpu runs local Transformers.
| Road | Embeddings | BERTopic labeling | Toponymy labeling |
|---|---|---|---|
openrouter |
OpenRouter | OpenRouter | OpenRouter |
hf_api |
HF Inference API | HF Inference API | HF Inference API |
local_cpu |
local SentenceTransformers | llama_server (GGUF) |
llama_server (GGUF) |
local_gpu |
local SentenceTransformers | local transformers |
local transformers |
On local roads, both backends can still switch between llama_server and
local via topic_model.llm_provider. Remote roads are uniform — the same
provider handles embeddings and labeling.
Embeddings¶
Embeddings are the expensive semantic step. They are cached under
data/cache/embeddings/ keyed by model name and a SHA-256 hash of the input
texts. Cache hits are instant; cache misses re-embed the full corpus.
Default preset models:
| Road | Embedding model |
|---|---|
openrouter |
qwen/qwen3-embedding-8b |
hf_api |
Qwen/Qwen3-Embedding-8B |
local_cpu |
google/embeddinggemma-300m |
local_gpu |
google/embeddinggemma-300m |
For local roads, the active Torch build decides whether these run on CPU or
CUDA. For early exploration on a large corpus, set
topic_model.sample_size to limit documents and set it back to null for
the final run.
To compare embedding models from the same corpus, start a variant from a completed run:
ads-bib run --from-run run_20260407_120000_ads_bib_openrouter \
--set topic_model.embedding_model=google/gemini-embedding-001
This keeps search/export, translation, tokenization, and optional AND outputs, then recomputes embeddings and later topic artifacts.
When using the OpenRouter preset with Toponymy, toponymy_embedding_model is
set separately to qwen/qwen3-embedding-8b. That keeps Toponymy's short
keyphrase/name embeddings fast even when the main document embedding model is
swapped for a larger model.
Common failure patterns (embeddings)
- Cache miss you did not expect (the corpus re-embeds in full) → check
whether
embedding_modelor the input text changed; the cache key is a SHA-256 of both. - Out-of-memory on a local road → set
topic_model.sample_sizefor exploration, lowerembedding_batch_size, or switch to a smaller embedding model for that run. - Slow embeddings on
local_cpu→google/embeddinggemma-300mis the CPU-friendly default. Avoid multi-billion-parameter embedding models on CPU.
Reduction¶
The default reduction method is pacmap. umap is available as an advanced
override and is the reason the optional ads-bib[umap] extra still exists.
Official presets use:
reduction_method: pacmap
params_5d:
n_neighbors: 30
metric: angular
random_state: 42
params_2d:
n_neighbors: 30
metric: angular
random_state: 42
n_neighbors has the most visible impact. Higher values (50–80) produce
broader, connected clusters; lower values (15–30) produce tighter, separated
groups. Scale down for datasets under 200 documents.
Common failure patterns (reduction)
- 2D map collapsed into one blob →
n_neighborsis too high; drop toward 15–30. - 2D map shattered into isolated specks →
n_neighborsis too low; raise toward 50–80. - Corpus under 200 documents → set
n_neighborsto 15–20; the default of 30 assumes enough points to estimate neighborhoods.
Clustering¶
Official presets use fast_hdbscan with:
cluster_params:
min_cluster_size: 15
min_samples: 3
cluster_selection_method: eom
cluster_selection_epsilon: 0.05
hdbscan stays available as an advanced override and is the reason the
optional ads-bib[hdbscan] extra still exists. For corpora below 500
documents, keep min_cluster_size at 15. For larger corpora, the auto-scaling
formula max(15, n_docs * 0.001) kicks in. Override via
cluster_params.min_cluster_size.
Toponymy currently uses Fast-HDBSCAN internals through the 0.2.x call
signature, so the package pins fast-hdbscan>=0.2.2,<0.3. This is separate
from toponymy_max_workers, which controls concurrent remote labeling and
embedding calls rather than clustering threads. For API-based Toponymy-internal
embeddings, toponymy_embedding_batch_size controls how many keyphrases or
topic names are sent per embedding request.
Common failure patterns (clustering)
- Too few topics (2–3) with a large outlier set → lower
min_cluster_size. - Too many micro-topics with <10 documents each → raise
min_cluster_size. - Noisy borders → raise
min_samplesfrom 2 to 3 or higher.
Common clustering variants:
# Switch the topic backend.
ads-bib run --from-run <run_id> --set topic_model.backend=toponymy
# Tune the flat BERTopic clusterer.
ads-bib run --from-run <run_id> --set topic_model.cluster_params.min_cluster_size=30
# Tune the Toponymy clusterer.
ads-bib run --from-run <run_id> --set topic_model.toponymy_cluster_params.min_clusters=8
All three start at topic_fit, so embeddings and reductions are reused.
Labeling¶
Labeling names each cluster via an LLM. Pick a prompt with
llm_prompt_name (physics for gravitational physics, generic for
domain-agnostic), or override with llm_prompt. bertopic_label_max_tokens
and toponymy_local_label_max_tokens cap label length.
For BERTopic, llm_prompt replaces the full labeling prompt. For Toponymy,
it appends extra naming instructions to Toponymy's built-in prompt templates,
which is the preferred way to keep hierarchy labels concise without forking
the backend prompt logic.
For BERTopic, representation runs a POS filter → KeyBERT → MMR → LLM before
the final label emerges. Outlier reassignment uses outlier_threshold
(default 0.5) — documents with assignment probability above that threshold
get pulled into their nearest cluster, then topic labels are refreshed.
Common failure patterns (labeling)
- Generic or empty labels (e.g. "Topic 1") → check
llm_prompt_name—physicsis tuned for gravitational physics,genericfor other domains. - Labeling times out or intermittently fails → lower
toponymy_max_workers, or switch the LLM provider for that run. - Outliers swallow most documents → raise
outlier_thresholdso more documents get reassigned into their nearest cluster before labels are regenerated.
Labeling-only variants also start at topic_fit because labels are part of
the fitted topic model:
# Use a different labeler model.
ads-bib run --from-run <run_id> --set topic_model.llm_model=google/gemini-3-flash-preview
# Switch from the physics prompt to the generic prompt.
ads-bib run --from-run <run_id> --set topic_model.llm_prompt_name=generic
Good Tuning Order¶
- Keep the query fixed.
- Choose the backend:
bertopicortoponymy. - Inspect embeddings quality: does the 2D scatter look structured at all?
- Tune
n_neighborsinparams_5dif clusters look too merged or too fragmented. - Tune
cluster_params.min_cluster_sizeandmin_samplesfor granularity. - For Toponymy, tune
toponymy_cluster_paramsin this order:min_clusters→base_min_cluster_size→base_n_clusters→next_cluster_size_quantile. - Leave
toponymy_layer_index="auto"unless you need a fixed working layer. - Only after that, experiment with labeling prompts or models.
For CLI iteration, prefer a variant from the last complete run:
Use --dry-run first when you want to confirm which stages will be reused.
For raw config keys, see Configuration. For phase-level tuning advice across the full pipeline, see the Pipeline Guide.