Skip to content

Configuration

Complete reference of all configuration keys. For explanations and tuning guidance, see the Pipeline Guide.

CLI Presets

The primary runtime path is the CLI. ads-bib ships four official packaged starter presets:

ads-bib run --preset openrouter --set search.query='author:"Hawking, S*"'
ads-bib preset write openrouter --output ads-bib.yaml
ads-bib doctor --preset openrouter --set search.query='author:"Hawking, S*"'

From Python, you can use the same packaged presets as the CLI with ads_bib.run(...):

import ads_bib

ads_bib.run(
    preset="openrouter",
    query='author:"Hawking, S*"',
)

Each preset defines one runtime road. They are generic starter configs, so you must set search.query before running. The usual command to start a run is ads-bib run. preset write is optional when you want one editable YAML file, and doctor is the support command for printing the full preflight report without starting a run.

For the provider stack, default backend, and hardware requirements of each preset, see Runtime Roads. One install covers all four presets; the preset switch only changes providers and defaults at runtime.

Install

Use a Python 3.12 env. One base install covers every preset:

uv pip install ads-bib

On NVIDIA / CUDA machines, add the validated CUDA Torch wheel into the same env so local_gpu runs on the GPU:

uv pip install ads-bib "torch==2.6.0" --extra-index-url https://download.pytorch.org/whl/cu124

If you need to restore the validated CPU wheel explicitly for a local CPU env:

uv pip install "torch==2.6.0" --extra-index-url https://download.pytorch.org/whl/cpu

The validated local HF stack for this release is Torch 2.6.x with Transformers 4.56.x.

Optional algorithm extras are available when you intentionally switch defaults:

uv pip install "ads-bib[umap]"
uv pip install "ads-bib[hdbscan]"

See Install & First Run for the full first-run walk-through.

Start a variant from a previous run

Every completed run saves the resolved configuration and reusable stage artifacts. Use --from-run when you want to change one key and keep the rest:

ads-bib run --from-run run_20260407_120000_ads_bib_openrouter \
  --set topic_model.embedding_model=google/gemini-embedding-001

ads-bib loads config_used.yaml, applies the override, picks the first affected stage, and writes a new run folder under runs/. Use --dry-run to see the changed keys and reused/recomputed stages before creating the variant.

Unless stated otherwise, the tables below describe the raw code defaults. The Preset Override column shows the value used by the four packaged starter presets when they deviate from the code default. Inspect src/ads_bib/_presets/*.yaml or write a preset locally with ads-bib preset write ... when you need the full road-specific starter config.

Notebook Section Dicts

The GitHub notebook uses ten inline configuration dicts:

RUN, SEARCH, TRANSLATE, LLAMA_SERVER, TOKENIZE, AUTHOR_DISAMBIGUATION, TOPIC_MODEL, VISUALIZATION, CURATION, CITATIONS

Each dict is passed to session.set_section(...). The keys below map directly to notebook dict keys and YAML config keys.


Run

Key Type Default Description
run_name string "ADS_Curation_Run" Identifier appended to the timestamped run directory name
start_stage string "search" First stage to run. Used from YAML or PipelineConfig when you do not pass a start stage from Python/CLI. CLI --from and ads_bib.run(start_stage=...) override this.
stop_stage string | null null Last stage to run; null runs to the end. CLI --to and ads_bib.run(stop_stage=...) override this.
random_seed int 42 Seed for reproducible reductions and clustering
openrouter_cost_mode string "hybrid" OpenRouter cost resolution. "hybrid" combines live usage with a pricing lookup (default). "strict" fails fast when cost data is incomplete. "fast" skips the extra lookup and trusts the streaming usage payload.
project_root string | null null Project folder for shared data/cache/ and run outputs under runs/; defaults to current working directory
Key Type Default Description
query string ADS search query (syntax reference)
ads_token string | null null ADS API token; falls back to ADS_TOKEN env var
refresh_search bool true Re-run the ADS query (set false to reuse cached bibcodes)
refresh_export bool true Re-resolve bibcodes to metadata (set false to reuse cached export)

Example query compositions:

# Simple author query
query: 'author:"Hawking, S*"'

# Author + topic filter
query: '(author:"Hawking, S*") AND abs:"black hole"'

# Seed + forward citations
query: 'author:"Hawking, S*" OR citations(author:"Hawking, S*")'

Translate

Key Type Default Description
enabled bool true Skip translation when false
provider string varies openrouter, nllb, llama_server, huggingface_api, or transformers
model string | null varies Model identifier for openrouter / huggingface_api / transformers, or an HF repo id / local path for nllb
model_repo string | null null HF repo for GGUF model download (llama_server provider)
model_file string | null null Filename within the repo (llama_server provider)
model_path string | null null Explicit local path to a GGUF file (llama_server provider)
api_key string | null null Provider API key; falls back to env var
max_workers int 10 Concurrent translation requests for remote providers; local transformers translation currently runs sequentially
max_tokens int 2048 Maximum tokens per translation request
fasttext_model string | null null Path to the fasttext language detection model; packaged presets set data/models/lid.176.bin

Llama Server

Shared configuration for pipeline stages that use llama_server as provider. This is the default local labeling path for local_cpu and an optional local labeling path for local_gpu.

Key Type Default Description
command string "llama-server" Default package-managed command token. With the default value, ads-bib tries PATH, then the managed cache, then an on-demand managed runtime download. Set an explicit path or custom command only to override that behavior.
host string "127.0.0.1" Bind address
port int | null null Port; null auto-selects a free port
threads int | null null CPU threads; null uses system default
ctx_size int 4096 Context window size in tokens
gpu_layers int -1 GPU layers to offload; -1 = GPU road default, 0 = CPU-managed local road. With the default command: "llama-server", a PATH-resolved runtime may still be probed with -1 first and fall back to 0 automatically.
startup_timeout_s float 120.0 Seconds to wait for the server to become ready
reasoning string "off" Reasoning mode; "off" for standard inference

Tokenize

Key Type Default Description
enabled bool true Skip tokenization when false
spacy_model string "en_core_web_md" spaCy model for lemmatization
batch_size int 512 Documents per spaCy batch
n_process int 1 Parallel spaCy processes
disable list ["ner", "parser", "textcat"] spaCy pipeline components to skip
fallback_model string "en_core_web_md" Fallback if primary model is unavailable
auto_download bool true Auto-download the spaCy model if missing

Author Disambiguation

Key Type Default Description
enabled bool false Enable author name disambiguation with ads-and
backend string "local" AND backend: local or modal
runtime string "auto" Local AND runtime: auto, cpu, or gpu; Modal resolves to GPU
modal_gpu string | null "l4" Modal GPU type when backend=modal: l4 or t4
model_bundle string | null null Advanced override for a disambiguation model bundle; null uses the packaged fixed bundle
dataset_id string | null null Dataset identifier for the AND package
force_refresh bool false Re-run disambiguation even if cached results exist
infer_stage string "full" Inference stage: full, incremental, or smaller ads-and stages such as smoke

enabled=true without further settings runs locally and does not start Modal. CPU is useful for small checks; use local GPU or Modal for larger corpora.

Topic Model

Core

Key Type Default Preset Override Description
sample_size int | null null Random subset size for exploration; null uses all documents
backend string "bertopic" openroutertoponymy bertopic (flat topic set) or toponymy (hierarchical layers)
clustering_method string "fast_hdbscan" HDBSCAN implementation; "hdbscan" for hierarchy analysis
outlier_threshold float 0.5 Probability threshold for outlier reassignment (BERTopic)
min_df int | null null all presets → 3 Minimum document frequency for topic terms; null enables auto-scaling as max(1, min(5, n_docs // 100))

Embeddings

Key Type Default Description
embedding_provider string varies local, openrouter, or huggingface_api
embedding_model string varies Model identifier (HF name or OpenRouter name)
embedding_api_key string | null null API key override for embedding provider
embedding_batch_size int 96 Documents per embedding batch
embedding_max_workers int 20 Concurrent embedding requests

Dimensionality Reduction

Key Type Default Description
reduction_method string "pacmap" pacmap or umap
params_5d dict see below Parameters for the 5D clustering reduction
params_2d dict see below Parameters for the 2D visualization reduction

Default params_5d and params_2d used by the official presets:

params_5d:
  n_neighbors: 30
  metric: angular
  random_state: 42
params_2d:
  n_neighbors: 30
  metric: angular
  random_state: 42

LLM Labeling

Key Type Default Description
llm_provider string varies openrouter, llama_server, huggingface_api, or local
llm_model string | null varies Model identifier for openrouter/huggingface_api
llm_model_repo string | null null HF repo for GGUF download (llama_server)
llm_model_file string | null null Filename within the repo (llama_server)
llm_model_path string | null null Explicit local GGUF path (llama_server)
llm_api_key string | null null API key override for LLM provider
llm_prompt_name string "physics" Named topic-label prompt/instruction set: physics or generic
llm_prompt string | null null Custom BERTopic prompt override or extra Toponymy naming instructions
bertopic_label_max_tokens int 128 Max tokens for BERTopic topic labels

BERTopic-Specific

Key Type Default Description
cluster_params dict see below HDBSCAN parameters
pipeline_models list ["POS", "KeyBERT", "MMR"] Sequential representation refinement pipeline
parallel_models list ["MMR", "POS", "KeyBERT"] Parallel comparison representations

Default cluster_params:

cluster_params:
  min_cluster_size: 15
  min_samples: 3
  cluster_selection_method: eom
  cluster_selection_epsilon: 0.05

Toponymy-Specific

Key Type Default Description
toponymy_cluster_params dict {} Toponymy clusterer overrides (min_clusters, base_min_cluster_size, etc.)
toponymy_layer_index string | int "auto" Working-layer selector; auto picks the coarsest layer
toponymy_local_label_max_tokens int 128 Max tokens for local Toponymy labels
toponymy_embedding_model string | null null Toponymy-internal embedding model; falls back to main embedding model
toponymy_embedding_batch_size int 96 Batch size for API-based Toponymy-internal embedding calls
toponymy_max_workers int 10 Concurrent labeling/embedding requests

Toponymy is validated with toponymy==0.4.0 and fast-hdbscan>=0.2.2,<0.3. The OpenRouter preset pins toponymy_embedding_model to qwen/qwen3-embedding-8b so swapping the main document embedding model does not accidentally make Toponymy keyphrase/name embeddings slower or more expensive. The toponymy_max_workers setting controls concurrent labeling and embedding requests; it does not change the internal HDBSCAN/Boruvka thread count.

Visualization

Key Type Default Description
enabled bool true Set false to skip HTML map generation
title string "ADS Topic Map" Map title rendered above the canvas; supports {topic_count} and {document_count} when you want counts in the heading
subtitle_template string "{topic_count} topics from {document_count:,} ADS records" Subtitle template; supports {topic_count} and {document_count}
dark_mode bool true Dark or light UI theme
font_family string "Cinzel" Google/system font for labels and titles
topic_tree bool false Expert-mode toggle for an extra hierarchy tree panel (Toponymy only)

Curation

Key Type Default Description
cluster_targets list [] Hierarchy-aware removals: [{layer: <int>, cluster_id: <int>}] (Toponymy)
clusters_to_remove list [] Flat cluster IDs to discard (BERTopic; also works for Toponymy working layer)

Example:

# BERTopic: remove clusters 3 and 4
curation:
  clusters_to_remove: [3, 4]

# Toponymy: remove noise from layer 1 and cluster 12 from layer 0
curation:
  cluster_targets:
    - layer: 1
      cluster_id: -1
    - layer: 0
      cluster_id: 12

Citations

Key Type Default Preset Override Description
metrics list ["direct", "co_citation", "bibliographic_coupling", "author_co_citation"] Network types to build
min_counts dict {direct: 1, co_citation: 1, bibliographic_coupling: 1, author_co_citation: 1} all presets → {direct: 2, co_citation: 3, bibliographic_coupling: 2, author_co_citation: 3} Minimum edge weight per metric
authors_filter list[string] | null null Optional string-based include filter on source publications (Author)
authors_filter_uids list[string] | null null Optional UID-based include filter on source publications (author_uids); requires author disambiguation output in memory
cited_authors_exclude list[string] | null null Optional string-based exclude filter on cited references (Author); matching references are pruned before network construction
cited_author_uids_exclude list[string] | null null Optional UID-based exclude filter on cited references (author_uids); requires author disambiguation output in memory
output_format string "gexf" Export format: gexf, graphology, csv, or all

The code default is 1 for every metric (everything keeps every edge). The four packaged presets raise those thresholds to practical starter values (2/3/2/3) so sparse author-focused corpora still retain usable structure. Override per metric via citations.min_counts.<metric>.

authors_filter and authors_filter_uids act on the source publication set. cited_authors_exclude and cited_author_uids_exclude act on the cited reference side by removing matching references from each publication before the direct, co-citation, bibliographic-coupling, and author-co-citation networks are computed.

For gexf, graphology, and network CSV exports, direct is exported as a directed graph. co_citation, bibliographic_coupling, and author_co_citation are exported as undirected weighted graphs. When a metric has richer edge provenance than the exported graph can carry compactly, ads-bib also writes a CSV evidence sidecar.

CLI Overrides

ads-bib run --config <file> --from <stage> --to <stage>
ads-bib run --config <file> --run-name <name>
ads-bib run --config <file> --set key.subkey=value

Scaling Formulas

These formulas auto-scale parameters based on corpus size. See the Pipeline Guide for when and why to override them.

Parameter Formula Notes
min_cluster_size max(15, n_docs * 0.001) ~0.1% of documents as minimum cluster
min_df max(1, min(5, n_docs // 100)) Suppresses noise terms in larger corpora
n_neighbors 15--80 Higher for larger datasets
min_counts (citations) Scale proportionally Keeps networks readable

Secrets

Keep API keys in .env only. Never commit them to notebook cells or YAML configs.

Variable Required when
ADS_TOKEN Always
OPENROUTER_API_KEY Using openrouter providers
HF_TOKEN Using huggingface_api providers (HF_API_KEY and HUGGINGFACE_API_KEY are also accepted)