Runtime Roads¶

ads-bib ships four official runtime roads. They share the same package install and differ only in provider keys, hardware, and the preset you pick. Author name disambiguation is a separate optional stage inside every road: enabled=true defaults to local auto, while backend=modal must be selected explicitly.

Pick a road¶

openrouter  — you accept a credit-card provider and want minimal local setup
hf_api      — your team already has a Hugging Face token and model workflow
local_cpu   — you want an offline-friendly run on a CPU-only machine
local_gpu   — you have an NVIDIA / CUDA GPU and want local acceleration

Road Matrix¶

Road	Hardware	Network	Cost model	Default backend
`openrouter`	any	API calls	pay-per-token	`toponymy`
`hf_api`	any	API calls	HF-plan-dependent	`bertopic`
`local_cpu`	CPU	only model downloads	none after setup	`bertopic`
`local_gpu`	NVIDIA + CUDA	only model downloads	none after setup	`bertopic`

Stack by road (translation / embeddings / labeling):

openrouter — OpenRouter chat model; OpenRouter embedding model; OpenRouter LLM.
hf_api — Hugging Face Inference API for all three.
local_cpu — NLLB via CTranslate2; SentenceTransformers; GGUF via llama_server.
local_gpu — TranslateGemma via transformers; SentenceTransformers; local transformers (optional GGUF via llama_server if you change topic_model.llm_provider).

The happy path is always:

uv pip install ads-bib
ads-bib run --preset <road> --set search.query='author:"Hawking, S*"'

For local_gpu on NVIDIA / CUDA, also install the validated CUDA Torch wheel as described in Install & First Run.

Author Disambiguation Backend¶

AND uses ads-and and is disabled by default in every preset. The default enabled path is local and cost-free:

ads-bib run --preset openrouter \
  --set search.query='author:"Hawking, S*"' \
  --set author_disambiguation.enabled=true

For larger runs without a local GPU, use Modal explicitly:

ads-bib run --preset openrouter \
  --set search.query='author:"Hawking, S*"' \
  --set author_disambiguation.enabled=true \
  --set author_disambiguation.backend=modal

Modal requires MODAL_TOKEN_ID and MODAL_TOKEN_SECRET in the environment or project .env.

`openrouter`¶

Smallest local footprint. Good default for the first remote run.

Keys: ADS_TOKEN, OPENROUTER_API_KEY
Hardware: any machine that can run the Python package
Defaults:
- translation: remote OpenRouter chat model
- document embeddings: remote OpenRouter embedding model
- Toponymy-internal embeddings: Qwen3 through OpenRouter
- labeling: remote OpenRouter LLM
- backend: toponymy
Not used: no local model downloads, no llama-server

`hf_api`¶

One remote provider with Hugging Face model identifiers.

Keys: ADS_TOKEN, HF_TOKEN
Hardware: any machine that can run the Python package
Defaults:
- translation: Hugging Face Inference API
- embeddings: Hugging Face Inference API
- labeling: Hugging Face Inference API
- backend: bertopic

hf_api supports both bertopic and toponymy via huggingface_api for embeddings and labeling. The provider stack stays identical across the two backends.

`local_cpu`¶

Local run without requiring CUDA.

Keys: ADS_TOKEN
Hardware: standard CPU machine
Defaults:
- translation: nllb via CTranslate2
- embeddings: local SentenceTransformers (google/embeddinggemma-300m)
- labeling: GGUF via llama_server, preset model mradermacher/Qwen3.5-0.8B-GGUF / Qwen3.5-0.8B.Q4_K_M.gguf
- backend: bertopic
Optional switch: set topic_model.llm_provider=local to use local Transformers labeling instead

The llama-server runtime is package-managed. With llama_server.command: "llama-server", ads-bib resolves the executable from PATH, then from the managed cache under data/models/llama_cpp/, then by downloading the pinned runtime on demand. Set llama_server.command to an explicit path only when you intentionally want to override that.

`local_gpu`¶

Local GPU road for machines with a compatible Torch/CUDA stack.

Keys: ADS_TOKEN
Hardware:
- NVIDIA / CUDA for the official accelerated path
- without CUDA, local HF/Torch work falls back to CPU and doctor flags the official GPU road as unsupported
Defaults:
- translation: google/translategemma-4b-it via local transformers
- embeddings: local SentenceTransformers (google/embeddinggemma-300m)
- labeling: local transformers with google/gemma-3-1b-it
- backend: bertopic
Optional switch: set topic_model.llm_provider=llama_server to use GGUF labeling instead

GPU runtime differs between Windows and Linux¶

When llama_server is used as the labeling provider on local_gpu, the managed binary is platform-specific:

OS	Managed `llama-server` build	PyTorch stack
Windows	CUDA 12.4	CUDA 12.4
Linux	Vulkan	CUDA 12.4

On Linux, the llama-server binary uses the official llama.cpp Vulkan build, while embeddings and transformers-based translation still run on CUDA via PyTorch. This split is deliberate: Vulkan is the supported distribution path for a prebuilt GPU binary of llama.cpp on Linux, and it works on the same NVIDIA driver stack as CUDA PyTorch. No extra action is required — the right binary is selected automatically.

First-Run Behavior¶

The first run on a fresh machine or in a fresh env is usually the slowest. ads-bib run may download or warm:

lid.176.bin for fastText language detection
the spaCy tokenization model
NLLB or TranslateGemma weights
SentenceTransformer model weights
the package-managed llama-server binary and GGUF weights

Later runs reuse those assets from cache. None of them add a pipeline stage — they only populate the caches for the stages you already asked to run.

If wall-clock time on the first attempt is the problem (not a missing file), see Troubleshooting — First run is slower than expected for the same list in a symptom → fix form.

Runtime Roads¶

Pick a road¶

Road Matrix¶

Author Disambiguation Backend¶

openrouter¶

hf_api¶

local_cpu¶

local_gpu¶