Data — Solstice Pro AI Ltd

RA 19h 18m DEC +38° 47′ · Field 01 — Posture

What we mean by "data" at an observatory

An observatory does not own its sky. The light has already been arriving for several billion years, and any organisation that pretends to monopolise its analysis has misread the room. Solstice Pro AI works with five principal archives — four of them public, one of them curated by us — and a small library of internal calibration sequences taken on test-bench sensors. We do not gate any of them behind a login wall; where we cannot redistribute (NASA, GOES, ZTF) we point at the canonical mirror and ship the loaders.

What we do claim is the validation harness. Each model we release is paired with a complete instrument-night replay: a chronologically ordered sequence of frames, masks and human-adjudicated labels that the model has never been allowed to touch during training. The replay is the artefact a buyer should care about. The trained weights are the by-product.

Below is the working register, kept in the same observatory-log style we use for nightly operations. Sizes are uncompressed working-set sizes; on-disk footprint after lz4 is roughly 0.55× for image cubes and 0.72× for FITS tables.

RA 19h 36m DEC +39° 12′ · Field 02 — Register

The five archives

Archive	Size	License	Last updated	Citation pattern
AIA EUV solar archive · SDO/AIA · 94 / 131 / 171 / 193 / 211 / 304 / 335 Å · 2010-05 → present	18.4 TB working set; full archive ≈ 1.4 PB at JSOC	NASA public domain · use without restriction; attribution requested	2026-05-09 (cadence: 12 s native, we cache 60 s)	Lemen et al. 2012 · SoPh 275 · doi:10.1007/s11207-011-9776-8
HMI magnetograms · SDO/HMI line-of-sight + vector · 2010-05 → present	9.1 TB working set; full archive ≈ 0.6 PB	NASA public domain · JSOC distribution	2026-05-09 (cadence: 720 s native, we keep all of it)	Scherrer et al. 2012 · SoPh 275 · doi:10.1007/s11207-011-9834-2
ZTF public alert stream replay · 2018-06 → 2024-12 · real-bogus, light curves, host stamps	11.7 TB packed AVRO; ≈ 8.3 billion alerts post-filter	ZTF public stream · DR-20+ terms; commercial use permitted with attribution	2026-04-22 (frozen replay set v3.1)	Bellm et al. 2019 · PASP 131:018002 · doi:10.1088/1538-3873/aaecbe
EMCCD twilight calibration set · in-house bench · Andor iXon 888, gains g = 1, 50, 100, 200	1.8 TB · 12 nights · 41 200 frames	CC-BY-4.0 · published 2025-12-04 with the Photon-001 release	2026-02-18 (added 4 nights of clouded-out diffuse-only frames)	Solstice Pro AI 2025 · dispatch-post-photon-limited-vision
Cosmic-ray segmentation corpus · 8 400 frames · hand-labelled · narrow-field 60 s exposures	0.6 TB · pixel-accurate masks · 14 instruments represented	CC-BY-SA-4.0 · published 2026-03-08; mirrored on Zenodo	2026-03-08 (v1.0 frozen; v1.1 in adjudication, expected Q3 2026)	Solstice Pro AI 2026 · Zenodo doi:10.5281/zenodo.7XXXXXX

Two of the five are ours to give away, and we do. The cosmic-ray corpus in particular took a junior astronomer and a senior one the better part of four months to label, double-adjudicate and triple-spot-check. We did not consider keeping it private; the community gains more from a published benchmark than we lose from a moat.

RA 19h 54m DEC +39° 41′ · Field 03 — Reproducibility

Deterministic builds and the validation notebook

An astronomical paper that cannot be reproduced is gossip. An astronomical model that cannot be reproduced is worse, because it gives the appearance of operational reliability without the substance. Solstice Pro AI commits, at the level of company policy, to the following reproducibility constraints on every model release.

Deterministic CUDA builds

Every Solstice container is pinned to a specific CUDA toolkit (currently 12.4), a specific cuDNN minor version, a specific PyTorch wheel and a specific set of torch.use_deterministic_algorithms(True) guard rails. The non-deterministic kernels we cannot avoid — scatter-add, some atomic reductions — are flagged in the release notes with the expected pixel-level variance across runs. A buyer who reruns inference on the same input on the same GPU class will see byte-identical output for everything else.

Validation notebook packaged with every release

The notebook is not a marketing artefact. It loads the held-out replay set, runs inference, computes the headline metric, and produces every plot in the release announcement. If we change a number on a dispatch post, we have changed the notebook, and the diff is in the git history. The notebook is the contract.

Frozen replay sets, not random splits

A random 80/20 split is the default sin of machine-learning practice and the one that an instrument night punishes hardest. Adjacent frames in a sky survey are correlated: same airmass, same seeing, same satellites passing overhead. A model evaluated on a random split will overstate its real-night performance by a half-step to a full step. We therefore evaluate on whole instrument nights — a complete observing session, sunrise to sunrise — that the model has not been allowed to see in any form during training.

RA 20h 13m DEC +40° 02′ · Field 04 — Provenance

Where each archive sits in the pipeline

Not all of these archives are training data. Some are validation only, some are calibration only, and the ZTF replay is both, in a strict temporal partition. The table below makes the role explicit; readers from observatory data-management offices should find it the most useful page on this site.

AIA · SDO

Helios training + validation

Training partition: 2010-05 → 2022-12 (active-region samples balanced by GOES class). Validation partition: 2023-01 → 2024-06 frozen for the Helios-002 release. 2024-07 → present held back for Helios-003.

HMI · SDO

Helios magnetogram branch

Identical temporal partition to AIA. We do not stack frames across the partition boundary; magnetogram evolution is the signal the model is supposed to learn.

ZTF · alert replay

Sky training + frozen replay

Training: 2018-06 → 2023-12. Frozen replay: 2024-01 → 2024-12, used as the headline benchmark for Sky-001. We do not retrain against the replay; once it is held out, it stays out.

EMCCD twilight

Photon calibration only

Used to characterise sensor noise and to generate physically realistic synthetic frames for rare-event augmentation. Never used as validation; calibration is a different epistemic category to performance.

CR segmentation

Photon validation + public benchmark

Held out as the published benchmark against which the LACosmic baseline is reported. v1.0 frozen and is the only number we quote in dispatches; v1.1 will require a new dispatch and a new headline.

In-house dark frames

Photon training only

≈ 2 700 dark and bias sequences across 6 sensor classes, used to seed the self-supervised denoiser's noise model. Not released; characterises specific test-bench hardware and would not generalise.

RA 20h 31m DEC +40° 28′ · Field 05 — Citation & reuse

How to cite us, and how we cite you

If you use the Solstice cosmic-ray segmentation corpus, cite the Zenodo DOI in the table above and, if convenient, the dispatch post that announced it. If you use the EMCCD twilight calibration set, cite the Photon-001 release notes and acknowledge the bench (Andor iXon 888, sensor serial pool S-44/S-45/S-46). If you use a trained Solstice model under a research-collaborator agreement, cite the model SHA and the dispatch post that published the headline metric; the SHA is printed on the splash screen of every container.

In the other direction, we cite every archive we depend on in the standard astronomical-journal pattern: lead author, year, journal, DOI, on the first reference and an abbreviation thereafter. The bibliography for any given model release is included in the validation notebook and re-rendered every time the notebook is rerun, so it cannot drift from what the model actually consumes.

For commercial reuse — operations integrators who want to redistribute Solstice corpora inside a paid product — the cosmic-ray segmentation corpus is CC-BY-SA-4.0, which means a derivative product is share-alike. The EMCCD set is CC-BY-4.0, which is permissive. If either license is incompatible with a buyer's pipeline, talk to us before the integration begins, not after.

Want the validation notebook before the model?

That is, in fact, the correct order. The notebook tells you whether the headline metric is something your instrument cares about. If it is, the container is a separate, smaller conversation.

Request a notebook walkthrough →

The data our models live and die against.