What "TSS 0.71" actually measures
The True Skill Statistic is the recall minus the false-positive rate. It is the standard metric for binary flare forecasting in the literature because it is insensitive to the (large) class imbalance between flaring and non-flaring active-region snapshots. A TSS of 0 is the no-skill baseline; 1 is perfect; published M-class-and-above 30-minute forecasts in the recent literature sit between 0.55 and 0.70 depending on the partition rules.
Our holdout (2025-01 through 2025-12, by active region) gives Helios TSS 0.71 at 30 minutes and 0.58 at 6 hours. On the harder leave-one-region-out cross-validation across the holdout, the 30-minute TSS drops to 0.66 — that is the number we will defend in a contested room, and it is the number an operations centre should be planning around. The 0.71 headline assumes a deployment that has retrained on data including the period immediately preceding the operational window.
Why magnetograms alone got us to 0.65 and stopped
The 2025-Q4 release used only HMI line-of-sight magnetograms. The TSS was 0.65 at 30 minutes. We knew, from a long literature on flare physics, that the magnetogram carries the photospheric driver but cannot easily encode the coronal reorganisation that immediately precedes an event. Adding EUV channels was the obvious next step; the question was how to add them without overfitting the network to channel-specific calibration artefacts that come and go on AIA over a decade of operations.
We tried three things in series. The first was a naive concatenation of all four channels at the input layer; it overfit the EUV branches to seasonal calibration variations in 2014–2015 and underperformed the magnetogram-only baseline on the holdout. The second was a frozen pretrained encoder per channel, joint-trained head; it helped slightly but was hard to debug because the frozen encoders were inscrutable. The third — the current architecture — uses a small per-channel encoder trained jointly from scratch with strong cross-channel data augmentation including per-channel intensity rescaling.
What each branch is actually worth
| Configuration | TSS @ 30 min | TSS @ 6 hr |
|---|---|---|
| HMI only (2025-Q4 baseline) | 0.65 | 0.54 |
| HMI + AIA 131 Å | 0.68 | 0.55 |
| HMI + AIA 193 Å | 0.67 | 0.56 |
| HMI + AIA 304 Å | 0.66 | 0.55 |
| HMI + AIA 131 + 193 (no 304) | 0.70 | 0.57 |
| HMI + AIA 131 + 193 + 304 | 0.71 | 0.58 |
The numbers do not entirely add up — 131, 193, and 304 individually contribute +0.03, +0.02, and +0.01, but together they contribute +0.06. The non-additivity is real and reproducible; we believe (and the dispatch does not pretend to have proven) that the three channels carry complementary information about the same pre-flare event and that the joint representation is more than the sum of single-channel improvements.
The step that turned 0.71 into a number an operations desk could use
A raw-softmax-probability operating point at TSS 0.71 produced calibration plots that anyone who has worked at an operations centre would refuse to deploy: the network was systematically over-confident in the 0.6–0.9 probability range, which is exactly the range a forecaster uses to decide whether to issue an alert. We post-process the raw network probability through an isotonic regression fit on a held-out calibration year (2024). After calibration the reliability diagram is flat to within ±0.04 across the full probability range; before calibration the diagonal departure was as much as 0.18 at probability 0.75.
Calibration was the single change that converted Helios from a research-grade benchmark winner into a piece of code an operations partner was willing to put into shadow traffic. We are convinced that any flare-forecasting paper that does not show a calibration plot is, in operational terms, incomplete.
What 0.71 does not tell you
Limb events (active regions within ~15° of the solar limb) are the dominant single source of confident-and-wrong Helios forecasts. The projection of the magnetogram becomes unreliable and the AIA branches see foreshortened structure; the network's representation drifts and the calibration assumption breaks. Our internal-deployment flag tags limb-proximate forecasts with a confidence multiplier, but the underlying problem is geometric and we do not believe it is solvable with more training data.
Multi-flare days produce a different failure mode: when two strong active regions are co-evolving and a precursor signature in one region propagates structurally into a neighbouring region's cutout, Helios attributes the flare to the wrong active region. The flare did happen, the lead time was right, but the spatial attribution was wrong — and a forecaster who acted on the spatial label would have prepared the wrong follow-up. The validation notebook reports a separate "near miss" confusion matrix that surfaces this.
Calibration drifts after long quiet periods. We monitor the rolling-30-day flare base rate and trigger a re-calibration when it falls below an internal threshold; the trigger fired three times on the holdout year. Without re-calibration the post-quiet operating points would have drifted by up to 0.06 in TSS.
The headline metric is the part of the work that gets read in the abstract. The calibration plot is the part of the work that gets read in the operations runbook. They are not the same audience and they should not be the same paragraph. — Solstice engineering desk · Dispatch 2026-04-22
The 2-hour horizon is the next benchmark we will defend
The current operational interest from our partner is the 2-hour forecast horizon, which sits between the well-studied 30-minute regime and the harder 6-hour regime where the magnetogram-driven physics begins to dominate over the coronal pre-event signatures. Our internal 2-hour TSS on the holdout is 0.62; we believe a 0.65 target is achievable within the next release cycle by including a higher-cadence HMI vector-magnetogram stream that we are evaluating. We will dispatch when the number holds up.
If you want the validation notebook
The notebook reproduces every TSS figure and every calibration plot in this dispatch. It runs end-to-end in under thirty minutes on a single A100. Email with the operations context and we will send it.
Open contact →