Helios training + validation
Training partition: 2010-05 → 2022-12 (active-region samples balanced by GOES class). Validation partition: 2023-01 → 2024-06 frozen for the Helios-002 release. 2024-07 → present held back for Helios-003.
A vision model is no better than the corpus it was trained on and no more honest than the validation set it has not yet seen. This page lists the archives Solstice Pro AI curates, ingests or replays, with sizes, licenses and citation patterns. None of it is proprietary; the discipline of using it well is the only thing that is.
An observatory does not own its sky. The light has already been arriving for several billion years, and any organisation that pretends to monopolise its analysis has misread the room. Solstice Pro AI works with five principal archives — four of them public, one of them curated by us — and a small library of internal calibration sequences taken on test-bench sensors. We do not gate any of them behind a login wall; where we cannot redistribute (NASA, GOES, ZTF) we point at the canonical mirror and ship the loaders.
What we do claim is the validation harness. Each model we release is paired with a complete instrument-night replay: a chronologically ordered sequence of frames, masks and human-adjudicated labels that the model has never been allowed to touch during training. The replay is the artefact a buyer should care about. The trained weights are the by-product.
Below is the working register, kept in the same observatory-log style we use for nightly operations. Sizes are uncompressed working-set sizes; on-disk footprint after lz4 is roughly 0.55× for image cubes and 0.72× for FITS tables.
| Archive | Size | License | Last updated | Citation pattern |
|---|---|---|---|---|
| AIA EUV solar archive · SDO/AIA · 94 / 131 / 171 / 193 / 211 / 304 / 335 Å · 2010-05 → present | 18.4 TB working set; full archive ≈ 1.4 PB at JSOC | NASA public domain · use without restriction; attribution requested | 2026-05-09 (cadence: 12 s native, we cache 60 s) | Lemen et al. 2012 · SoPh 275 · doi:10.1007/s11207-011-9776-8 |
| HMI magnetograms · SDO/HMI line-of-sight + vector · 2010-05 → present | 9.1 TB working set; full archive ≈ 0.6 PB | NASA public domain · JSOC distribution | 2026-05-09 (cadence: 720 s native, we keep all of it) | Scherrer et al. 2012 · SoPh 275 · doi:10.1007/s11207-011-9834-2 |
| ZTF public alert stream replay · 2018-06 → 2024-12 · real-bogus, light curves, host stamps | 11.7 TB packed AVRO; ≈ 8.3 billion alerts post-filter | ZTF public stream · DR-20+ terms; commercial use permitted with attribution | 2026-04-22 (frozen replay set v3.1) | Bellm et al. 2019 · PASP 131:018002 · doi:10.1088/1538-3873/aaecbe |
| EMCCD twilight calibration set · in-house bench · Andor iXon 888, gains g = 1, 50, 100, 200 | 1.8 TB · 12 nights · 41 200 frames | CC-BY-4.0 · published 2025-12-04 with the Photon-001 release | 2026-02-18 (added 4 nights of clouded-out diffuse-only frames) | Solstice Pro AI 2025 · dispatch-post-photon-limited-vision |
| Cosmic-ray segmentation corpus · 8 400 frames · hand-labelled · narrow-field 60 s exposures | 0.6 TB · pixel-accurate masks · 14 instruments represented | CC-BY-SA-4.0 · published 2026-03-08; mirrored on Zenodo | 2026-03-08 (v1.0 frozen; v1.1 in adjudication, expected Q3 2026) | Solstice Pro AI 2026 · Zenodo doi:10.5281/zenodo.7XXXXXX |
Two of the five are ours to give away, and we do. The cosmic-ray corpus in particular took a junior astronomer and a senior one the better part of four months to label, double-adjudicate and triple-spot-check. We did not consider keeping it private; the community gains more from a published benchmark than we lose from a moat.
An astronomical paper that cannot be reproduced is gossip. An astronomical model that cannot be reproduced is worse, because it gives the appearance of operational reliability without the substance. Solstice Pro AI commits, at the level of company policy, to the following reproducibility constraints on every model release.
Every Solstice container is pinned to a specific CUDA toolkit (currently 12.4), a specific cuDNN minor version, a specific PyTorch wheel and a specific set of torch.use_deterministic_algorithms(True) guard rails. The non-deterministic kernels we cannot avoid — scatter-add, some atomic reductions — are flagged in the release notes with the expected pixel-level variance across runs. A buyer who reruns inference on the same input on the same GPU class will see byte-identical output for everything else.
The notebook is not a marketing artefact. It loads the held-out replay set, runs inference, computes the headline metric, and produces every plot in the release announcement. If we change a number on a dispatch post, we have changed the notebook, and the diff is in the git history. The notebook is the contract.
A random 80/20 split is the default sin of machine-learning practice and the one that an instrument night punishes hardest. Adjacent frames in a sky survey are correlated: same airmass, same seeing, same satellites passing overhead. A model evaluated on a random split will overstate its real-night performance by a half-step to a full step. We therefore evaluate on whole instrument nights — a complete observing session, sunrise to sunrise — that the model has not been allowed to see in any form during training.
Not all of these archives are training data. Some are validation only, some are calibration only, and the ZTF replay is both, in a strict temporal partition. The table below makes the role explicit; readers from observatory data-management offices should find it the most useful page on this site.
Training partition: 2010-05 → 2022-12 (active-region samples balanced by GOES class). Validation partition: 2023-01 → 2024-06 frozen for the Helios-002 release. 2024-07 → present held back for Helios-003.
Identical temporal partition to AIA. We do not stack frames across the partition boundary; magnetogram evolution is the signal the model is supposed to learn.
Training: 2018-06 → 2023-12. Frozen replay: 2024-01 → 2024-12, used as the headline benchmark for Sky-001. We do not retrain against the replay; once it is held out, it stays out.
Used to characterise sensor noise and to generate physically realistic synthetic frames for rare-event augmentation. Never used as validation; calibration is a different epistemic category to performance.
Held out as the published benchmark against which the LACosmic baseline is reported. v1.0 frozen and is the only number we quote in dispatches; v1.1 will require a new dispatch and a new headline.
≈ 2 700 dark and bias sequences across 6 sensor classes, used to seed the self-supervised denoiser's noise model. Not released; characterises specific test-bench hardware and would not generalise.
If you use the Solstice cosmic-ray segmentation corpus, cite the Zenodo DOI in the table above and, if convenient, the dispatch post that announced it. If you use the EMCCD twilight calibration set, cite the Photon-001 release notes and acknowledge the bench (Andor iXon 888, sensor serial pool S-44/S-45/S-46). If you use a trained Solstice model under a research-collaborator agreement, cite the model SHA and the dispatch post that published the headline metric; the SHA is printed on the splash screen of every container.
In the other direction, we cite every archive we depend on in the standard astronomical-journal pattern: lead author, year, journal, DOI, on the first reference and an abbreviation thereafter. The bibliography for any given model release is included in the validation notebook and re-rendered every time the notebook is rerun, so it cannot drift from what the model actually consumes.
For commercial reuse — operations integrators who want to redistribute Solstice corpora inside a paid product — the cosmic-ray segmentation corpus is CC-BY-SA-4.0, which means a derivative product is share-alike. The EMCCD set is CC-BY-4.0, which is permissive. If either license is incompatible with a buyer's pipeline, talk to us before the integration begins, not after.
That is, in fact, the correct order. The notebook tells you whether the headline metric is something your instrument cares about. If it is, the container is a separate, smaller conversation.
Request a notebook walkthrough →