Skip to content

Output Artifacts

A completed ads-bib run writes a resolved config, a run summary, tabular publication/reference data, citation-network exports, and an interactive topic map.

Run Layout

runs/run_20260407_120000_ads_bib_openrouter/
├── config_used.yaml
├── run_summary.yaml
├── logs/
│   └── runtime.log
├── data/
│   ├── search/
│   │   └── search_results.json
│   ├── export/
│   │   ├── publications.parquet
│   │   └── references.parquet
│   ├── translated/
│   │   ├── publications.parquet
│   │   └── references.parquet
│   ├── tokenized/
│   │   ├── publications.parquet
│   │   └── references.parquet
│   ├── and/
│   │   ├── publications.parquet
│   │   ├── references.parquet
│   │   └── author_entities.parquet
│   ├── dataset/
│   │   ├── publications.parquet
│   │   ├── references.parquet
│   │   ├── topic_info.parquet
│   │   └── dataset_manifest.json
│   └── citations/
│       ├── direct.gexf
│       ├── co_citation.gexf
│       ├── bibliographic_coupling.gexf
│       ├── author_co_citation.gexf
│       └── download_wos_export.txt
└── plots/
    └── topic_map.html

data/cache/ lives outside the run folder and is shared by later runs and variants. Files inside runs/<run_id>/ are the artifacts for that exact run. The stage directories (search, export, translated, tokenized, and) are run-local restart points; --from-run uses them before consulting any project-wide cache.

The public Parquet bundle is prepared for downstream analysis when it is written: duplicate Bibcode rows are reduced deterministically, publication References lists are normalized to known reference rows, and optional author_uids contain unique real-person entities rather than placeholders such as No author.

config_used.yaml

The resolved, normalized PipelineConfig actually used for the run. You can feed it back into the CLI after setting the needed environment variables:

ads-bib run --config runs/<run_id>/config_used.yaml

Secrets such as ADS tokens, OpenRouter keys, and Hugging Face tokens are written as <redacted>, so keep real credentials in .env or your shell environment. Use the file to audit what values the preset + CLI overrides resolved to. For iteration, prefer ads-bib run --from-run <run_id> --set ...; it loads this file safely, restores redacted secrets from .env/environment values, and reuses any still-valid artifacts.

run_summary.yaml

Compact run report written at the end of each run.

schema_version: 2
artifact_layout_version: 2
run:
  run_id: run_20260407_120000_ads_bib_openrouter
  run_name: ads_bib_openrouter
  started_at_utc: "2026-03-05T12:36:44+00:00"
  ended_at_utc: "2026-03-05T12:52:11+00:00"
  duration_seconds: 927.34
  duration_minutes: 15.46
  status: completed        # or "failed"
  error: null
stages:
  requested_start_stage: search
  requested_stop_stage: null
  completed_stages: [search, export, translate, ...]
  failed_stage: null
reproducibility:
  config_path: runs/.../config_used.yaml
  config_sha256: "abc123..."
  git_commit: "def456..."
  git_dirty: false
counts:
  total_processing:
    publications: 361
    references: 1301
  topic_model:
    documents_modeled: 348
    topics_nunique: 6
    outliers_count: 13
    outliers_rate: 0.0374
  curated:
    publications: 348
topic_hierarchy:             # Toponymy only
  topic_layer_count: 3
  topic_primary_layer_index: 2
  topic_clusters_per_layer: [15, 8, 4]
  topic_primary_layer_selection: auto
variant:                     # only for --from-run variants
  base_run_id: run_20260407_120000_ads_bib_openrouter
  base_run_path: runs/run_20260407_120000_ads_bib_openrouter
  changed_keys:
    - topic_model.embedding_model
  recomputed_from: embeddings
  reused_until: author_disambiguation
costs:
  total_tokens: 125000
  total_cost_usd: 0.0234
  by_step:
    - step: translation
      provider: openrouter
      model: google/gemini-3-flash-preview
      prompt_tokens: 50000
      completion_tokens: 45000
      total_tokens: 95000
      calls: 94
      cost_usd: 0.0156

Key fields:

  • schema_version — bumped on breaking changes to this file.
  • artifact_layout_version — identifies the canonical v0.2 run folder layout used by --from-run variants.
  • stages.completed_stages — usable for resume-style runs with --from <next_stage>.
  • reproducibility.config_sha256 — same value for two runs means they used byte-identical configs.
  • counts.topic_model.outliers_rate — quality proxy; very high rates usually mean clusters are too sharp.
  • costs — only populated for providers with cost tracking (OpenRouter respects openrouter_cost_mode; HF API calls are not billed through this tracker).
  • variant — present only for --from-run variants. It records the base run, changed keys, first recomputed stage, and last reused stage. This is additive under schema_version: 2.

publications.parquet

The curated document-level output. Columns accumulate across stages:

Stage Columns
Export Bibcode, Author, Title, Year, Journal, Abstract, Citation Count, DOI, Affiliation, ...
Translation Title_lang, Title_en, Abstract_lang, Abstract_en
Tokenization full_text, tokens
AND (optional) author_uids, author_display_names
Embeddings (cached separately, not in DataFrame)
Reduction embedding_5d_0 ... embedding_5d_4, embedding_2d_x, embedding_2d_y
Topic (BERTopic) topic_id, Name
Topic (Toponymy) topic_id, Name, topic_layer_<n>_id, topic_layer_<n>_label, topic_primary_layer_index, topic_layer_count

Schema conventions:

  • All pipeline-produced columns use snake_case.
  • topic_id is the document-topic membership column (int). -1 = outlier.
  • Name is the human-readable topic label.
  • embedding_5d_0 ... embedding_5d_4 are the reduced coordinates used for clustering.
  • embedding_2d_x / embedding_2d_y are the 2D coordinates for visualization.
  • Full embedding vectors stay in data/cache/embeddings/*.npz; they are not written into the run parquet files.
  • For Toponymy, topic_id and Name are working-layer aliases. The canonical hierarchy is topic_layer_<n>_id / topic_layer_<n>_label, where layer 0 is the finest and higher layers are coarser.

A typical row after a completed BERTopic run looks like this (truncated to the most useful columns):

Bibcode          Year  Title_en                                   topic_id  Name                        embedding_5d_0  embedding_2d_x  embedding_2d_y
1974Natur.248..  1974  Black hole explosions?                     2         Hawking radiation           0.14          -3.42           1.88
1975CMaPh..43..  1975  Particle creation by black holes           2         Hawking radiation           0.18          -3.18           2.04
1988PhRvD..37..  1988  Wave function of the Universe              4         Quantum cosmology           -0.31           1.67          -0.92
1996PhRvL..77..  1996  Microscopic origin of the entropy          1         Black hole thermodynamics    0.06          -2.15          -0.41
2005PhRvD..72..  2005  Information loss in black holes            2         Hawking radiation           0.16          -3.01           1.73

Load it back with pandas.read_parquet("runs/<run_id>/data/dataset/publications.parquet"). For Toponymy runs, each row additionally carries topic_layer_0_id, topic_layer_0_label, … up to topic_layer_<n>_* and the two hierarchy metadata columns topic_primary_layer_index and topic_layer_count.

If author disambiguation ran, author_uids and author_display_names are cleaned entity lists for analysis. They may be shorter than the raw Author list because duplicate person IDs and non-person placeholders are removed. The raw Author list remains unchanged for provenance and display.

references.parquet

The normalized cited-reference table. It uses the same front-loaded metadata ordering as publications.parquet where columns overlap: Bibcode, Year, Author, Title, translated title/abstract columns, journal metadata, DOI, and optional author-disambiguation columns.

Every ID retained in publications.References is present in references.Bibcode. Missing reference rows from the ADS export are pruned from the final reference lists and recorded in dataset_manifest.json.

dataset_manifest.json

Manifest for the final bundle. It records artifact hashes, row counts, topic coordinate columns, author-disambiguation availability, and a cleaning block with the number of duplicate keys, dangling reference mentions, placeholder author UIDs, and duplicate author UID mentions removed during bundle export.

topic_info.parquet

The topic-level table. It has one row per topic, not one row per publication. Typical columns are Topic, Count, Name, and representation fields such as Main, MMR, POS, KeyBERT, Representation, and Representative_Docs when the backend provides them.

Use publications.parquet when you need document-level assignments and coordinates. Use topic_info.parquet when you need topic labels, counts, or representative terms/documents without repeating them for every publication.

.gexf Node Attributes

Every publication node in the exported .gexf files carries:

Bibcode, Author, Title, Year, Journal, Abstract, Citation Count, DOI, topic_id, Name, embedding_5d_0 ... embedding_5d_4, embedding_2d_x, embedding_2d_y, Title_en, Abstract_en.

For Toponymy runs, nodes additionally carry topic_layer_<n>_id, topic_layer_<n>_label, topic_primary_layer_index, and topic_layer_count.

The four network files (direct, co_citation, bibliographic_coupling, author_co_citation) share the same node schema and differ only in edge semantics. See Citation Networks for the interpretation of each.

author_co_citation.gexf from author:"Hawking, S*", opened in Gephi Lite.

download_wos_export.txt

A WOS-format plain-text export of the curated dataset. Use it for CiteSpace and VOSviewer, which both import WOS-style records natively.

topic_map.html

The interactive datamapplot visualization. A self-contained HTML file — open it directly in any modern browser. Controls: hover for metadata, Shift+drag to lasso a word-cloud region, Shift+drag on the timeline to filter years, click a topic entry to isolate it.

topic_map.html from author:"Hawking, S*" in datamapplot.

Author disambiguation (AND)

Author Name Disambiguation stays an optional external integration. ads-bib owns only the source-level adapter, which:

  • stages ADS-shaped publications and references as source files,
  • calls an external source-based disambiguation function,
  • validates the source-mirrored outputs and maps them back into pipeline DataFrames,
  • caches disambiguated stage snapshots for resume,
  • passes disambiguated author IDs into author-based citation exports.

Expected source inputs:

  • Bibcode
  • Author
  • Year
  • Title_en or Title
  • Abstract_en or Abstract
  • optional Affiliation

Expected source-mirrored output additions:

  • AuthorUID
  • AuthorDisplayName

Mapped pipeline outputs normalize these into:

  • author_uids
  • author_display_names

When AND is enabled, diagnostic outputs are also mirrored under runs/<run_id>/data/and/ when ads-and produces them:

  • source_author_assignments.parquet
  • author_entities.parquet
  • mention_clusters.parquet
  • summary.json
  • 05_stage_metrics_infer_sources.json
  • 05_go_no_go_infer_sources.json