Skip to content

Python API

Use the Python API when you need ads-bib inside scripts or notebooks. For standard runs, prefer the CLI or the high-level ads_bib.run(...) command.

The CLI (ads-bib run --preset ...) and the high-level Python function (ads_bib.run(...)) share the same preset-driven run path. Use the lower-level APIs when you need a custom PipelineConfig, notebook-driven exploration, or experiments on top of the topic-model primitives.

Pick an Entry Point

Five use cases, five entry points. Pick by what you want to do, then jump to the reference section further down for the signature.

1. Reproducible terminal run

You want to start a run from a preset and have the full artifact tree on disk. Use the CLI: ads-bib run.

2. Programmatic full run from Python

Same goal as (1), but driven from a script, notebook cell, or wrapper function — and you may want the in-memory results back. Use ads_bib.run.

3. Interactive, stage-by-stage exploration

You want to run stages one at a time, inspect intermediate results between stages, and adjust section configs without re-running upstream work. Use NotebookSession.

4. Pre-built config straight into the runner

You already have a fully-built PipelineConfig (for example loaded from runs/<run_id>/config_used.yaml or constructed in code) and want to hand it directly to the pipeline runner. Use PipelineConfig.from_dict followed by run_pipeline.

5. Topic modeling on your own texts

You want to apply the topic-model primitives to arbitrary text, outside the ADS data flow. Use the low-level chain: compute_embeddingsreduce_dimensionsfit_bertopic or fit_toponymybuild_topic_dataframe. Citation networks run independently via process_all_citations.

Stable Top-Level Imports

from ads_bib import (
    run,
    RunBlockedError,
    PipelineConfig,
    NotebookSession,
    run_pipeline,
    compute_embeddings,
    reduce_dimensions,
    fit_bertopic,
    fit_toponymy,
    build_topic_dataframe,
    process_all_citations,
    reduce_outliers,
)

The full export list is in src/ads_bib/__init__.py.

End-to-End Example

The simplest programmatic run mirrors ads-bib run --preset ...:

import ads_bib

ads_bib.run(
    preset="openrouter",
    query='author:"Hawking, S*"',
)

Keep the return value only when you want the in-memory outputs:

result = ads_bib.run(
    preset="openrouter",
    query='author:"Hawking, S*"',
)

print(result.publications.shape)
print(result.topic_df.columns)
print(result.curated_df.head())

# `PipelineContext` fields you will touch most often:
# result.publications, result.refs, result.topic_df, result.curated_df,
# result.citation_results, result.paths, result.config

ads_bib.run creates shared project data directories and runs/ under project_root, writes config_used.yaml and run_summary.yaml, and persists run artifacts under runs/<run_id>/data/, including stage restart points (search, export, translated, tokenized, and) and final outputs (dataset, citations).

ads_bib.run

Source: src/ads_bib/runner.py

run(
    *,
    preset: str | None = None,
    config: PipelineConfig | Mapping[str, Any] | Path | str | None = None,
    query: str | None = None,
    overrides: Mapping[str, Any] | None = None,
    start_stage: StageName | None = None,
    stop_stage: StageName | None = None,
    run_name: str | None = None,
    project_root: Path | str | None = None,
    preflight: bool = True,
) -> PipelineContext

Use either preset or config. query is a shortcut for search.query, and overrides accepts the same dotted keys as CLI --set:

result = ads_bib.run(
    preset="local_cpu",
    query='author:"Hawking, S*"',
    overrides={"topic_model.backend": "bertopic"},
    start_stage="search",
    stop_stage="citations",
)

With preflight=True, the function performs the same run preflight as the CLI and raises RunBlockedError if required keys, dependencies, or managed runtime preparation block the run. It auto-selects notebook-friendly progress output under Jupyter and CLI-style output in terminal Python runs. The return value is still a PipelineContext, even if the simple examples ignore it.

PipelineConfig

Source: src/ads_bib/pipeline.py

PipelineConfig(
    run=RunConfig(),
    search=SearchConfig(),
    translate=TranslateConfig(),
    llama_server=LlamaServerConfig(),
    tokenize=TokenizeConfig(),
    author_disambiguation=AuthorDisambiguationConfig(),
    topic_model=TopicModelConfig(),
    visualization=VisualizationConfig(),
    curation=CurationConfig(),
    citations=CitationsConfig(),
)

Use the from_dict classmethod to build a config from a plain Python dict — it normalizes stage names and provider choices and rejects unknown keys:

cfg = PipelineConfig.from_dict({
    "search": {"query": 'author:"Hawking, S*"'},
    "translate": {"provider": "nllb"},
    "author_disambiguation": {"enabled": True, "backend": "local", "runtime": "auto"},
})

Every top-level key maps to one of the ten section configs documented in Configuration. For author disambiguation, model_bundle=None uses the packaged ads-and fixed bundle. Set backend="modal" only when Modal credentials are configured.

run_pipeline

Source: src/ads_bib/pipeline.py:1804

run_pipeline(
    config: PipelineConfig,
    *,
    start_stage: StageName | None = None,
    stop_stage: StageName | None = None,
    project_root: Path | str | None = None,
    run_name: str | None = None,
    paths: dict[str, Path] | None = None,
    run: RunManager | None = None,
    tracker: CostTracker | None = None,
    start_time: float | None = None,
    load_environment: bool = True,
    output_mode: OutputMode = "cli",
) -> PipelineContext

Runs the full pipeline or a stage-bounded slice. When start_stage or stop_stage is None, values from config.run in the YAML/object are used (the same names as in the run table in Configuration). When you pass start_stage / stop_stage here, they override those config values. They use the same stage names as the CLI (search, translate, ..., citations). load_environment=True reads .env from project_root. output_mode="notebook" uses notebook-friendly progress display.

Returns a PipelineContext whose attributes expose the materialized stage outputs: publications, refs, documents, embeddings, reduced_5d, reduced_2d, topic_model, topic_info, topic_df, curated_df, citation_results.

NotebookSession

Source: src/ads_bib/notebook.py:55

NotebookSession(
    *,
    project_root: Path | str | None = None,
    run_name: str = "ADS_Curation_Run",
    start_time: float | None = None,
)

The interactive notebook wrapper. It owns one PipelineContext and rebuilds it incrementally: when you update a section dict, the session detects which stages that change invalidates and discards only the affected downstream state.

from ads_bib import NotebookSession

session = NotebookSession(project_root=".")

session.set_section("search", {"query": 'author:"Hawking, S*"'})
session.set_section("topic_model", {"backend": "toponymy"})

session.run_stage("search")
session.run_stage("export")
session.run_stage("translate")
# ... continue stage by stage

set_section(name, values)

Source: src/ads_bib/notebook.py:182

Updates one config section in place and rebuilds the prepared config. Valid section names (from SECTION_NAMES in src/ads_bib/notebook.py:32):

run
search
translate
llama_server
tokenize
author_disambiguation
topic_model
visualization
curation
citations

set_section("run", {...}) intentionally rejects changes to run_name within an existing session — recreate the session to start a new run directory.

Accessors

Every stage output is exposed as a read-only property on the session: publications, refs, documents, embeddings, reduced_5d, reduced_2d, topic_model, topic_info, topic_df, curated_df, citation_results, plus config, run, paths, and tracker.

Low-Level Topic-Model Chain

Use these when you want to run topic modeling on your own texts — outside the ADS data flow — or when you want to swap one step without driving the full pipeline. They are all row-aligned: documents[i], embeddings[i], reduced[i], and topics[i] must refer to the same input document.

compute_embeddings

Source: src/ads_bib/topic_model/embeddings.py:171

compute_embeddings(
    documents: list[str],
    *,
    provider: str,           # "local" | "huggingface_api" | "openrouter"
    model: str,
    cache_dir: Path | None = None,
    batch_size: int = 96,
    max_workers: int = 5,
    api_key: str | None = None,
    openrouter_cost_mode: str = "hybrid",
    show_progress: bool = True,
    ...
) -> np.ndarray

Computes or reloads cached document embeddings. Returns an (n_documents, embedding_dim) array. Pass a cache_dir to enable on-disk caching. Cache files include the provider, model, and input fingerprint so different corpora can use the same embedding model without overwriting each other.

from pathlib import Path
from ads_bib import compute_embeddings

embeddings = compute_embeddings(
    documents,
    provider="local",
    model="google/embeddinggemma-300m",
    cache_dir=Path("data/cache/embeddings"),
)

reduce_dimensions

Source: src/ads_bib/topic_model/reduction.py:166

reduce_dimensions(
    embeddings: np.ndarray,
    *,
    method: str = "pacmap",   # "pacmap" | "umap"
    params_5d: dict | None = None,
    params_2d: dict | None = None,
    random_state: int = 42,
    cache_dir: Path | None = None,
    ...
) -> tuple[np.ndarray, np.ndarray]

Returns (reduced_5d, reduced_2d) — the 5D array is the input to clustering, the 2D array is the visualization coordinate space.

from ads_bib import reduce_dimensions

reduced_5d, reduced_2d = reduce_dimensions(
    embeddings,
    method="pacmap",
    params_5d={"n_neighbors": 30, "metric": "angular"},
    params_2d={"n_neighbors": 30, "metric": "angular"},
)

fit_bertopic

Source: src/ads_bib/topic_model/backends.py:1820

fit_bertopic(
    documents: list[str],
    reduced_5d: np.ndarray,
    *,
    llm_provider: str = "local",
    llm_model: str = "google/gemma-3-1b-it",
    clustering_method: str = "fast_hdbscan",
    clustering_params: dict | None = None,
    ...
) -> BERTopic

Fits BERTopic on the 5D vectors and runs the configured labeling path. Returns a fitted BERTopic model. Pass the result to build_topic_dataframe.

fit_toponymy

Source: src/ads_bib/topic_model/backends.py:2009

fit_toponymy(
    documents: list[str],
    embeddings: np.ndarray,
    clusterable_vectors: np.ndarray,
    *,
    backend: str = "toponymy",
    layer_index: int | str = "auto",
    llm_provider: str = "openrouter",
    llm_model: str = "google/gemini-3-flash-preview",
    embedding_provider: str = "local",
    embedding_model: str = "google/gemini-embedding-001",
    ...
) -> tuple[Any, np.ndarray, pd.DataFrame]

Fits Toponymy and returns (topic_model, topics, topic_info). The topics array is the working-layer assignment vector (compatibility view for BERTopic-style downstream code); the hierarchy layers live on the model and end up on the topic DataFrame when you pass it through build_topic_dataframe.

build_topic_dataframe

Source: src/ads_bib/topic_model/output.py:40

build_topic_dataframe(
    df: pd.DataFrame,
    topic_model,                 # fitted BERTopic or Toponymy
    topics: np.ndarray,
    reduced_2d: np.ndarray,
    embeddings: np.ndarray | None = None,
    topic_info: pd.DataFrame | None = None,
    *,
    reduced_5d: np.ndarray | None = None,
) -> pd.DataFrame

Returns a copy of df with topic_id, Name, embedding_2d_x, embedding_2d_y, optional embedding_5d_<n> columns, optional full_embeddings, and — for Toponymy — the topic_layer_<n>_id / topic_layer_<n>_label hierarchy columns plus topic_primary_layer_index and topic_layer_count.

process_all_citations

Source: src/ads_bib/citations.py:632

process_all_citations(
    bibcodes: list[str],
    references: list[list[str]],
    publications: pd.DataFrame,
    ref_df: pd.DataFrame,
    all_nodes: pd.DataFrame,
    *,
    metrics: Sequence[str] = ("direct", "co_citation",
                              "bibliographic_coupling", "author_co_citation"),
    min_counts: Mapping[str, int] | None = None,
    authors_filter: list[str] | None = None,
    authors_filter_uids: list[str] | None = None,
    cited_authors_exclude: list[str] | None = None,
    cited_author_uids_exclude: list[str] | None = None,
    output_format: str = "gexf",
    output_dir: Path | str = "data/output",
    author_entities: pd.DataFrame | None = None,
    show_progress: bool = True,
) -> dict[str, pd.DataFrame]

Computes every selected citation metric and writes the exports. publications must have Bibcode, Year, Author, References; all_nodes must have at least an id column plus any metadata you want to persist on the .gexf nodes. Returns one exported graph edge DataFrame per metric.

authors_filter keeps the existing string-based source-publication filtering. authors_filter_uids adds the same inclusion step for disambiguated author_uids. cited_authors_exclude and cited_author_uids_exclude remove matching cited references before network construction. For non-direct metrics, graph exports are aggregated weighted graphs, while the full detail/provenance rows are written to CSV sidecars.

Notebook Companion

The repository also includes pipeline.ipynb as an optional interactive frontend for the same NotebookSession API. It is not shipped in the ads-bib wheel — clone or download the repository if you want to use it. The notebook uses the same config keys documented throughout this site.