Configuration¶
Complete reference of all configuration keys. For explanations and tuning guidance, see the Pipeline Guide.
CLI Presets¶
The primary runtime path is the CLI. ads-bib ships four official packaged
starter presets:
ads-bib run --preset openrouter --set search.query='author:"Hawking, S*"'
ads-bib preset write openrouter --output ads-bib.yaml
ads-bib doctor --preset openrouter --set search.query='author:"Hawking, S*"'
From Python, you can use the same packaged presets as the CLI with ads_bib.run(...):
Each preset defines one runtime road. They are generic starter configs, so you
must set search.query before running. The usual command to start a run is
ads-bib run. preset write is optional when you want one editable YAML file, and
doctor is the support command for printing the full preflight report without
starting a run.
For the provider stack, default backend, and hardware requirements of each preset, see Runtime Roads. One install covers all four presets; the preset switch only changes providers and defaults at runtime.
Install¶
Use a Python 3.12 env. One base install covers every preset:
On NVIDIA / CUDA machines, add the validated CUDA Torch and TorchVision wheels
into the same env so local_gpu runs on the GPU:
uv pip install ads-bib "torch==2.6.0" "torchvision==0.21.0" --extra-index-url https://download.pytorch.org/whl/cu124
If you need to restore the validated CPU wheel explicitly for a local CPU env:
uv pip install "torch==2.6.0" "torchvision==0.21.0" --extra-index-url https://download.pytorch.org/whl/cpu
The validated local HF stack for this release is Torch 2.6.x with Transformers 4.56.x.
Optional algorithm extras are available when you intentionally switch defaults:
See Install & First Run for the full first-run walk-through.
Start a variant from a previous run
Every completed run saves the resolved configuration and reusable stage
artifacts. Use --from-run when you want to change one key and keep the
rest:
ads-bib run --from-run run_20260407_120000_ads_bib_openrouter \
--set topic_model.embedding_model=google/gemini-embedding-001
ads-bib loads config_used.yaml, applies the override, picks the first
affected stage, and writes a new run folder under runs/. Use --dry-run
to see the changed keys and reused/recomputed stages before creating the
variant.
Unless stated otherwise, the tables below describe the raw code defaults. The
Preset Override column shows the value used by the four packaged starter
presets when they deviate from the code default. Inspect
src/ads_bib/_presets/*.yaml or write a preset locally with
ads-bib preset write ... when you need the full road-specific starter config.
Notebook Companion¶
The root Colab notebook is a small frontend for the same preset-driven runner
path. It loads the local_gpu preset, sets the example query and run folder,
and runs the resolved config. It does not define a separate configuration
schema; the keys below are the YAML, CLI override, Python, and notebook keys.
Run¶
| Key | Type | Default | Description |
|---|---|---|---|
run_name |
string | "ads_bib_run" |
Identifier appended to the timestamped run directory name |
start_stage |
string | "search" |
First stage to run. Used from YAML or PipelineConfig when you do not pass a start stage from Python/CLI. CLI --from and ads_bib.run(start_stage=...) override this. |
stop_stage |
string | null | null |
Last stage to run; null runs to the end. CLI --to and ads_bib.run(stop_stage=...) override this. |
random_seed |
int | 42 |
Seed for reproducible reductions and clustering |
openrouter_cost_mode |
string | "hybrid" |
OpenRouter cost resolution. "hybrid" combines live usage with a pricing lookup (default). "strict" fails fast when cost data is incomplete. "fast" skips the extra lookup and trusts the streaming usage payload. |
project_root |
string | null | null |
Project folder for shared data/cache/ and run outputs under runs/; defaults to current working directory |
Search¶
| Key | Type | Default | Description |
|---|---|---|---|
query |
string | — | ADS search query (syntax reference) |
ads_token |
string | null | null |
ADS API token; falls back to ADS_TOKEN env var |
refresh_search |
bool | true |
Re-run the ADS query (set false to reuse cached bibcodes) |
refresh_export |
bool | true |
Re-resolve bibcodes to metadata (set false to reuse cached export) |
Example query compositions:
# Simple author query
query: 'author:"Hawking, S*"'
# Author + topic filter
query: '(author:"Hawking, S*") AND abs:"black hole"'
# Seed + forward citations
query: 'author:"Hawking, S*" OR citations(author:"Hawking, S*")'
Source Input¶
Use source_input when the publication/reference corpus already exists outside
ADS, for example after preparing Semantic Scholar or INSPIRE exports in the
repository. When both paths are set, the high-level runner loads those Parquet
files as the initial corpus and starts from the first requested downstream
stage; leave the paths unset for normal ADS search/export runs.
| Key | Type | Default | Description |
|---|---|---|---|
publications_path |
string | null | null |
Parquet file with source publications |
references_path |
string | null | null |
Parquet file with cited/reference records |
source_name |
string | null | null |
Optional source label recorded with the run config |
source_input:
publications_path: data/source/publications.parquet
references_path: data/source/references.parquet
source_name: semantic_scholar
run:
start_stage: translate
Translate¶
| Key | Type | Default | Description |
|---|---|---|---|
enabled |
bool | true |
Skip translation when false |
provider |
string | varies | openrouter, nllb, llama_server, huggingface_api, or transformers |
model |
string | null | varies | Model identifier for openrouter / huggingface_api / transformers, or an HF repo id / local path for nllb |
model_repo |
string | null | null |
HF repo for GGUF model download (llama_server provider) |
model_file |
string | null | null |
Filename within the repo (llama_server provider) |
model_path |
string | null | null |
Explicit local path to a GGUF file (llama_server provider) |
api_key |
string | null | null |
Provider API key; falls back to env var |
max_workers |
int | 10 |
Concurrent translation requests for remote providers; initial batch size for local transformers translation |
max_tokens |
int | 2048 |
Maximum tokens per translation request |
fasttext_model |
string | null | null |
Path to the fasttext language detection model; packaged presets set data/models/lid.176.bin |
Llama Server¶
Shared configuration for pipeline stages that use llama_server as provider.
This is the default local labeling path for local_cpu and an optional local
labeling path for local_gpu.
| Key | Type | Default | Description |
|---|---|---|---|
command |
string | "llama-server" |
Default package-managed command token. With the default value, ads-bib tries PATH, then the managed cache, then an on-demand managed runtime download. Set an explicit path or custom command only to override that behavior. |
host |
string | "127.0.0.1" |
Bind address |
port |
int | null | null |
Port; null auto-selects a free port |
threads |
int | null | null |
CPU threads; null uses system default |
ctx_size |
int | 4096 |
Context window size in tokens |
gpu_layers |
int | -1 |
GPU layers to offload; -1 = GPU road default, 0 = CPU-managed local road. With the default command: "llama-server", a PATH-resolved runtime may still be probed with -1 first and fall back to 0 automatically. |
startup_timeout_s |
float | 120.0 |
Seconds to wait for the server to become ready |
reasoning |
string | "off" |
Reasoning mode; "off" for standard inference |
Tokenize¶
| Key | Type | Default | Description |
|---|---|---|---|
enabled |
bool | true |
Skip tokenization when false |
spacy_model |
string | "en_core_web_md" |
spaCy model for lemmatization |
batch_size |
int | 512 |
Documents per spaCy batch |
n_process |
int | 1 |
Parallel spaCy processes |
disable |
list | ["ner", "parser", "textcat"] |
spaCy pipeline components to skip |
fallback_model |
string | "en_core_web_md" |
Fallback if primary model is unavailable |
auto_download |
bool | true |
Auto-download the spaCy model if missing |
Author Disambiguation¶
| Key | Type | Default | Description |
|---|---|---|---|
enabled |
bool | false |
Enable author name disambiguation with ads-and |
backend |
string | "local" |
AND backend: local or modal |
runtime |
string | "auto" |
Local AND runtime: auto, cpu, or gpu; Modal resolves to GPU |
modal_gpu |
string | null | "l4" |
Modal GPU type when backend=modal: l4 or t4 |
model_bundle |
string | null | null |
Advanced override for a disambiguation model bundle; null uses the packaged fixed bundle |
dataset_id |
string | null | null |
Dataset identifier for the AND package |
force_refresh |
bool | false |
Re-run disambiguation even if cached results exist |
infer_stage |
string | "full" |
Inference stage: full, incremental, or smaller ads-and stages such as smoke |
enabled=true without further settings runs locally and does not start Modal.
CPU is useful for small checks; use local GPU or Modal for larger corpora.
Topic Model¶
Core¶
| Key | Type | Default | Preset Override | Description |
|---|---|---|---|---|
sample_size |
int | null | null |
— | Random subset size for exploration; null uses all documents |
backend |
string | "bertopic" |
openrouter → toponymy |
bertopic (flat topic set) or toponymy (hierarchical layers) |
clustering_method |
string | "fast_hdbscan" |
— | HDBSCAN implementation; "hdbscan" for hierarchy analysis |
outlier_threshold |
float | 0.5 |
— | Probability threshold for outlier reassignment (BERTopic) |
min_df |
int | null | null |
all presets → 3 |
Minimum document frequency for topic terms; null enables auto-scaling as max(1, min(5, n_docs // 100)) |
Embeddings¶
| Key | Type | Default | Description |
|---|---|---|---|
embedding_provider |
string | varies | local, openrouter, or huggingface_api |
embedding_model |
string | varies | Model identifier (HF name or OpenRouter name) |
embedding_api_key |
string | null | null |
API key override for embedding provider |
embedding_batch_size |
int | 96 |
Documents per embedding batch |
embedding_max_workers |
int | 20 |
Concurrent embedding requests |
Dimensionality Reduction¶
| Key | Type | Default | Description |
|---|---|---|---|
reduction_method |
string | "pacmap" |
pacmap or umap |
params_5d |
dict | see below | Parameters for the 5D clustering reduction |
params_2d |
dict | see below | Parameters for the 2D visualization reduction |
Default params_5d and params_2d used by the official presets:
params_5d:
n_neighbors: 30
metric: angular
random_state: 42
params_2d:
n_neighbors: 30
metric: angular
random_state: 42
LLM Labeling¶
| Key | Type | Default | Description |
|---|---|---|---|
llm_provider |
string | varies | openrouter, llama_server, huggingface_api, or local |
llm_model |
string | null | varies | Model identifier for openrouter, huggingface_api, or local Transformers labeling |
llm_model_repo |
string | null | null |
HF repo for GGUF download (llama_server) |
llm_model_file |
string | null | null |
Filename within the repo (llama_server) |
llm_model_path |
string | null | null |
Explicit local GGUF path (llama_server) |
llm_api_key |
string | null | null |
API key override for LLM provider |
llm_prompt_name |
string | "physics" |
Named topic-label prompt/instruction set: physics or generic |
llm_prompt |
string | null | null |
Custom BERTopic prompt override or extra Toponymy naming instructions |
bertopic_label_max_tokens |
int | 128 |
Max tokens for BERTopic topic labels |
BERTopic-Specific¶
| Key | Type | Default | Description |
|---|---|---|---|
cluster_params |
dict | see below | HDBSCAN parameters |
pipeline_models |
list | ["POS", "KeyBERT", "MMR"] |
Sequential representation refinement pipeline |
parallel_models |
list | ["MMR", "POS", "KeyBERT"] |
Parallel comparison representations |
Default cluster_params:
cluster_params:
min_cluster_size: 15
min_samples: 3
cluster_selection_method: eom
cluster_selection_epsilon: 0.05
Toponymy-Specific¶
| Key | Type | Default | Description |
|---|---|---|---|
toponymy_cluster_params |
dict | {} |
Toponymy clusterer overrides (min_clusters, base_min_cluster_size, etc.) |
toponymy_layer_index |
string | int | "auto" |
Working-layer selector; auto picks the coarsest layer |
toponymy_local_label_max_tokens |
int | 128 |
Max tokens for local Toponymy labels |
toponymy_embedding_model |
string | null | null |
Toponymy-internal embedding model; falls back to main embedding model |
toponymy_embedding_batch_size |
int | 96 |
Batch size for API-based Toponymy-internal embedding calls |
toponymy_max_workers |
int | 10 |
Concurrent labeling/embedding requests |
Toponymy is validated with toponymy==0.4.0 and fast-hdbscan>=0.2.2,<0.3.
The OpenRouter preset pins toponymy_embedding_model to qwen/qwen3-embedding-8b
so swapping the main document embedding model does not accidentally make
Toponymy keyphrase/name embeddings slower or more expensive. The
toponymy_max_workers setting controls concurrent labeling and embedding
requests; it does not change the internal HDBSCAN/Boruvka thread count.
Visualization¶
| Key | Type | Default | Description |
|---|---|---|---|
enabled |
bool | true |
Set false to skip HTML map generation |
title |
string | "ADS Topic Map" |
Map title rendered above the canvas; supports {topic_count} and {document_count} when you want counts in the heading |
subtitle_template |
string | "{topic_count} topics from {document_count:,} ADS records" |
Subtitle template; supports {topic_count} and {document_count} |
dark_mode |
bool | true |
Dark or light UI theme |
font_family |
string | "Cinzel" |
Google/system font for labels and titles |
topic_tree |
bool | false |
Expert-mode toggle for an extra hierarchy tree panel (Toponymy only) |
Curation¶
| Key | Type | Default | Description |
|---|---|---|---|
clusters_to_remove |
list | [] |
Flat cluster list to discard (BERTopic; also Toponymy's selected working layer) |
layered_clusters_to_remove |
list | [] |
Layered cluster list for Toponymy: one or more {layer, cluster_id} mappings |
Use the simplest field that identifies the clusters you inspected:
| Use case | Setting |
|---|---|
| BERTopic or another flat topic model | clusters_to_remove |
| Toponymy, removing clusters from the selected working layer | clusters_to_remove |
| Toponymy, removing clusters from explicit hierarchy layers | layered_clusters_to_remove |
Examples:
# BERTopic: remove clusters 3 and 4
curation:
clusters_to_remove: [3, 4]
# Toponymy: remove noise from layer 1 and cluster 12 from layer 0
curation:
layered_clusters_to_remove:
- layer: 1
cluster_id: -1
- layer: 0
cluster_id: 12
clusters_to_remove is always a list. For one cluster, use
clusters_to_remove: [7], not clusters_to_remove: 7. Cluster IDs are
run-local, so inspect a completed run first and apply removals with a variant
run from that same run.
layered_clusters_to_remove is also a list, but each list item is a mapping
with both layer and cluster_id. Multiple selections are combined: a document
is removed when it matches any selection. Coarser Toponymy layers can remove
more documents than finer layers because their clusters group broader branches
of the hierarchy.
For CLI overrides, quote the whole value. The quotes protect the YAML-style list or mapping from your shell; they are not part of the value:
ads-bib run --from-run <run_id> --set 'curation.clusters_to_remove=[7, 12]'
ads-bib run --from-run <run_id> --set 'curation.layered_clusters_to_remove=[{layer: 0, cluster_id: 12}, {layer: 1, cluster_id: 20}]'
Citations¶
| Key | Type | Default | Preset Override | Description |
|---|---|---|---|---|
metrics |
list | ["direct", "co_citation", "bibliographic_coupling", "author_co_citation"] |
— | Network types to build |
min_counts |
dict | {direct: 1, co_citation: 1, bibliographic_coupling: 1, author_co_citation: 1} |
all presets → {direct: 2, co_citation: 3, bibliographic_coupling: 2, author_co_citation: 3} |
Minimum edge weight per metric |
authors_filter |
list[string] | null | null |
— | Optional string-based include filter on source publications (Author) |
authors_filter_uids |
list[string] | null | null |
— | Optional UID-based include filter on source publications (author_uids); requires author disambiguation output in memory |
cited_authors_exclude |
list[string] | null | null |
— | Optional string-based exclude filter on cited references (Author); matching references are pruned before network construction |
cited_author_uids_exclude |
list[string] | null | null |
— | Optional UID-based exclude filter on cited references (author_uids); requires author disambiguation output in memory |
output_format |
string | "gexf" |
— | Export format: gexf, graphology, csv, or all |
The code default is 1 for every metric (everything keeps every edge). The
four packaged presets raise those thresholds to practical starter values
(2/3/2/3) so sparse author-focused corpora still retain usable structure. Override
per metric via citations.min_counts.<metric>.
authors_filter and authors_filter_uids act on the source publication set.
cited_authors_exclude and cited_author_uids_exclude act on the cited reference side by removing
matching references from each publication before the direct, co-citation,
bibliographic-coupling, and author-co-citation networks are computed.
For gexf, graphology, and network CSV exports, direct is exported as a
directed graph. co_citation, bibliographic_coupling, and
author_co_citation are exported as undirected weighted graphs. When a metric
has richer edge provenance than the exported graph can carry compactly,
ads-bib also writes a CSV evidence sidecar.
CLI Overrides¶
ads-bib run --config <file> --from <stage> --to <stage>
ads-bib run --config <file> --run-name <name>
ads-bib run --config <file> --set key.subkey=value
Scaling Formulas¶
These formulas auto-scale parameters based on corpus size. See the Pipeline Guide for when and why to override them.
| Parameter | Formula | Notes |
|---|---|---|
min_cluster_size |
max(15, n_docs * 0.001) |
~0.1% of documents as minimum cluster |
min_df |
max(1, min(5, n_docs // 100)) |
Suppresses noise terms in larger corpora |
n_neighbors |
15--80 | Higher for larger datasets |
min_counts (citations) |
Scale proportionally | Keeps networks readable |
Secrets¶
Keep API keys in .env only. Never commit them to notebook cells or YAML
configs.
| Variable | Required when |
|---|---|
ADS_TOKEN |
Always |
OPENROUTER_API_KEY |
Using openrouter providers |
HF_TOKEN |
Using huggingface_api providers or local_gpu model access |
Read next¶
- Pipeline Guide — when to retune which stage
- Troubleshooting — if runs exit early
- Output Artifacts — what each file in
runs/<run_id>/is for