Output Artifacts¶
A completed ads-bib run writes a resolved config, a run summary, tabular
publication/reference data, citation-network exports, and an interactive topic
map.
Run Layout¶
runs/run_20260407_120000_ads_bib_openrouter/
├── config_used.yaml
├── run_summary.yaml
├── logs/
│ └── runtime.log
├── data/
│ ├── search/
│ │ └── search_results.json
│ ├── export/
│ │ ├── publications.parquet
│ │ └── references.parquet
│ ├── translated/
│ │ ├── publications.parquet
│ │ └── references.parquet
│ ├── tokenized/
│ │ ├── publications.parquet
│ │ └── references.parquet
│ ├── and/
│ │ ├── publications.parquet
│ │ ├── references.parquet
│ │ └── author_entities.parquet
│ ├── dataset/
│ │ ├── publications.parquet
│ │ ├── references.parquet
│ │ ├── topic_info.parquet
│ │ └── dataset_manifest.json
│ └── citations/
│ ├── direct.gexf
│ ├── co_citation.gexf
│ ├── bibliographic_coupling.gexf
│ ├── author_co_citation.gexf
│ └── download_wos_export.txt
└── plots/
└── topic_map.html
data/cache/ lives outside the run folder and is shared by later runs and
variants. Files inside runs/<run_id>/ are the artifacts for that exact run.
The stage directories (search, export, translated, tokenized, and)
are run-local restart points; --from-run uses them before consulting any
project-wide cache.
The public Parquet bundle is prepared for downstream analysis when it is
written: duplicate Bibcode rows are reduced deterministically, publication
References lists are normalized to known reference rows, and optional
author_uids contain unique real-person entities rather than placeholders
such as No author.
config_used.yaml¶
The resolved, normalized PipelineConfig actually used for the run. You can
feed it back into the CLI after setting the needed environment variables:
Secrets such as ADS tokens, OpenRouter keys, and Hugging Face tokens are written
as <redacted>, so keep real credentials in .env or your shell environment.
Use the file to audit what values the preset + CLI overrides resolved to. For
iteration, prefer ads-bib run --from-run <run_id> --set ...; it loads this
file safely, restores redacted secrets from .env/environment values, and
reuses any still-valid artifacts.
run_summary.yaml¶
Compact run report written at the end of each run.
schema_version: 2
artifact_layout_version: 2
run:
run_id: run_20260407_120000_ads_bib_openrouter
run_name: ads_bib_openrouter
started_at_utc: "2026-03-05T12:36:44+00:00"
ended_at_utc: "2026-03-05T12:52:11+00:00"
duration_seconds: 927.34
duration_minutes: 15.46
status: completed # or "failed"
error: null
stages:
requested_start_stage: search
requested_stop_stage: null
completed_stages: [search, export, translate, ...]
failed_stage: null
reproducibility:
config_path: runs/.../config_used.yaml
config_sha256: "abc123..."
git_commit: "def456..."
git_dirty: false
counts:
total_processing:
publications: 361
references: 1301
topic_model:
documents_modeled: 348
topics_nunique: 6
outliers_count: 13
outliers_rate: 0.0374
curated:
publications: 348
topic_hierarchy: # Toponymy only
topic_layer_count: 3
topic_primary_layer_index: 2
topic_clusters_per_layer: [15, 8, 4]
topic_primary_layer_selection: auto
variant: # only for --from-run variants
base_run_id: run_20260407_120000_ads_bib_openrouter
base_run_path: runs/run_20260407_120000_ads_bib_openrouter
changed_keys:
- topic_model.embedding_model
recomputed_from: embeddings
reused_until: author_disambiguation
costs:
total_tokens: 125000
total_cost_usd: 0.0234
by_step:
- step: translation
provider: openrouter
model: google/gemini-3-flash-preview
prompt_tokens: 50000
completion_tokens: 45000
total_tokens: 95000
calls: 94
cost_usd: 0.0156
Key fields:
schema_version— bumped on breaking changes to this file.artifact_layout_version— identifies the canonical v0.2 run folder layout used by--from-runvariants.stages.completed_stages— usable for resume-style runs with--from <next_stage>.reproducibility.config_sha256— same value for two runs means they used byte-identical configs.counts.topic_model.outliers_rate— quality proxy; very high rates usually mean clusters are too sharp.costs— only populated for providers with cost tracking (OpenRouter respectsopenrouter_cost_mode; HF API calls are not billed through this tracker).variant— present only for--from-runvariants. It records the base run, changed keys, first recomputed stage, and last reused stage. This is additive underschema_version: 2.
publications.parquet¶
The curated document-level output. Columns accumulate across stages:
| Stage | Columns |
|---|---|
| Export | Bibcode, Author, Title, Year, Journal, Abstract, Citation Count, DOI, Affiliation, ... |
| Translation | Title_lang, Title_en, Abstract_lang, Abstract_en |
| Tokenization | full_text, tokens |
| AND (optional) | author_uids, author_display_names |
| Embeddings | (cached separately, not in DataFrame) |
| Reduction | embedding_5d_0 ... embedding_5d_4, embedding_2d_x, embedding_2d_y |
| Topic (BERTopic) | topic_id, Name |
| Topic (Toponymy) | topic_id, Name, topic_layer_<n>_id, topic_layer_<n>_label, topic_primary_layer_index, topic_layer_count |
Schema conventions:
- All pipeline-produced columns use
snake_case. topic_idis the document-topic membership column (int).-1= outlier.Nameis the human-readable topic label.embedding_5d_0...embedding_5d_4are the reduced coordinates used for clustering.embedding_2d_x/embedding_2d_yare the 2D coordinates for visualization.- Full embedding vectors stay in
data/cache/embeddings/*.npz; they are not written into the run parquet files. - For Toponymy,
topic_idandNameare working-layer aliases. The canonical hierarchy istopic_layer_<n>_id/topic_layer_<n>_label, where layer 0 is the finest and higher layers are coarser.
A typical row after a completed BERTopic run looks like this (truncated to the most useful columns):
Bibcode Year Title_en topic_id Name embedding_5d_0 embedding_2d_x embedding_2d_y
1974Natur.248.. 1974 Black hole explosions? 2 Hawking radiation 0.14 -3.42 1.88
1975CMaPh..43.. 1975 Particle creation by black holes 2 Hawking radiation 0.18 -3.18 2.04
1988PhRvD..37.. 1988 Wave function of the Universe 4 Quantum cosmology -0.31 1.67 -0.92
1996PhRvL..77.. 1996 Microscopic origin of the entropy 1 Black hole thermodynamics 0.06 -2.15 -0.41
2005PhRvD..72.. 2005 Information loss in black holes 2 Hawking radiation 0.16 -3.01 1.73
Load it back with pandas.read_parquet("runs/<run_id>/data/dataset/publications.parquet").
For Toponymy runs, each row additionally carries topic_layer_0_id,
topic_layer_0_label, … up to topic_layer_<n>_* and the two hierarchy
metadata columns topic_primary_layer_index and topic_layer_count.
If author disambiguation ran, author_uids and author_display_names are
cleaned entity lists for analysis. They may be shorter than the raw Author
list because duplicate person IDs and non-person placeholders are removed.
The raw Author list remains unchanged for provenance and display.
references.parquet¶
The normalized cited-reference table. It uses the same front-loaded metadata
ordering as publications.parquet where columns overlap: Bibcode, Year,
Author, Title, translated title/abstract columns, journal metadata, DOI,
and optional author-disambiguation columns.
Every ID retained in publications.References is present in
references.Bibcode. Missing reference rows from the ADS export are pruned
from the final reference lists and recorded in dataset_manifest.json.
dataset_manifest.json¶
Manifest for the final bundle. It records artifact hashes, row counts, topic
coordinate columns, author-disambiguation availability, and a cleaning block
with the number of duplicate keys, dangling reference mentions, placeholder
author UIDs, and duplicate author UID mentions removed during bundle export.
topic_info.parquet¶
The topic-level table. It has one row per topic, not one row per publication.
Typical columns are Topic, Count, Name, and representation fields such as
Main, MMR, POS, KeyBERT, Representation, and Representative_Docs
when the backend provides them.
Use publications.parquet when you need document-level assignments and
coordinates. Use topic_info.parquet when you need topic labels, counts, or
representative terms/documents without repeating them for every publication.
.gexf Node Attributes¶
Every publication node in the exported .gexf files carries:
Bibcode, Author, Title, Year, Journal, Abstract,
Citation Count, DOI, topic_id, Name, embedding_5d_0 ...
embedding_5d_4, embedding_2d_x, embedding_2d_y, Title_en,
Abstract_en.
For Toponymy runs, nodes additionally carry topic_layer_<n>_id,
topic_layer_<n>_label, topic_primary_layer_index, and
topic_layer_count.
The four network files (direct, co_citation, bibliographic_coupling,
author_co_citation) share the same node schema and differ only in edge
semantics. See Citation Networks for the
interpretation of each.
download_wos_export.txt¶
A WOS-format plain-text export of the curated dataset. Use it for CiteSpace and VOSviewer, which both import WOS-style records natively.
topic_map.html¶
The interactive datamapplot visualization. A self-contained HTML file — open it directly in any modern browser. Controls: hover for metadata, Shift+drag to lasso a word-cloud region, Shift+drag on the timeline to filter years, click a topic entry to isolate it.
Author disambiguation (AND)¶
Author Name Disambiguation stays an optional external integration.
ads-bib owns only the source-level adapter, which:
- stages ADS-shaped
publicationsandreferencesas source files, - calls an external source-based disambiguation function,
- validates the source-mirrored outputs and maps them back into pipeline DataFrames,
- caches disambiguated stage snapshots for resume,
- passes disambiguated author IDs into author-based citation exports.
Expected source inputs:
BibcodeAuthorYearTitle_enorTitleAbstract_enorAbstract- optional
Affiliation
Expected source-mirrored output additions:
AuthorUIDAuthorDisplayName
Mapped pipeline outputs normalize these into:
author_uidsauthor_display_names
When AND is enabled, diagnostic outputs are also mirrored under
runs/<run_id>/data/and/ when ads-and produces them:
source_author_assignments.parquetauthor_entities.parquetmention_clusters.parquetsummary.json05_stage_metrics_infer_sources.json05_go_no_go_infer_sources.json
Read next¶
- Citation Networks — how to read each graph type
- Troubleshooting — if exports are missing or empty
- Configuration — tuning
citations.*keys