Solstice Pro AIObservatory · 16544720
RA 21h 02m 11s DEC −03° 18′ 44″ Epoch J2026.07 Plate VI.c Field Dispatch 2026-01-25
Plate VI.c · Filed 2026-01-25 · Sky mission

A real-bogus classifier you can actually put in front of an alert broker.

This dispatch is about the operational regime that survey alert classification lives in, not the academic regime. We will look at why a 99% recall at 1% bogus contamination — the canonical benchmark of half the literature — is not a number an alert broker can deploy on; what changes when you reframe the problem around a 0.5% contamination budget; how the cosmic-ray segmentation pre-step quietly does half of the work; and what we expect to break when the same architecture meets the LSST volume.

Author Solstice engineering desk · M. Renholm Filed 2026-01-25 Reference Sky r2026.01 validation notebook
RA 21h 04m DEC −03°  ·  Section 01 — The contamination budget

Why "99% at 1%" is not deployable

A typical clear night on a ZTF-class survey produces of order one million alerts. The astrophysical transient yield is small — perhaps fifty to a few hundred objects a human astronomer actually wants to follow up on. A classifier operating at 1% bogus contamination on the full stream therefore admits ten thousand spurious candidates per night into the downstream system. A broker that produces ten thousand spurious candidates per night is a broker that the community will stop using by month two; the contamination budget that an operations-grade broker can absorb is well under one per thousand.

This is the framing that should be applied to any "real-bogus classifier" paper: not "what is the area under your ROC curve" but "what is your recall at the contamination rate your downstream system can actually pay for". For Sky, that operating point is 0.5% — conservative for ZTF, plausibly tight for LSST. At 0.5% contamination on the 2024-Jan replay set, Sky's recall is 97.4%.

RA 21h 20m DEC −02°  ·  Section 02 — The cosmic-ray pre-step

The unsung half of the system

Cosmic rays are the dominant source of bogus alerts in any survey stream we have worked with. The default cosmic-ray rejection in most survey pipelines is LACosmic with default parameters; on the 2024-Jan replay set LACosmic at defaults gives a pixel-level F1 of 0.871 with a 2.1% false-positive rate (the false-positive rate matters because LACosmic occasionally over-flags PSF cores, which then propagates as a missing-real-source error downstream). Sky's cosmic-ray U-Net, trained on 8 400 hand-labelled survey frames, reaches F1 0.962 at 0.4% false-positive rate.

The interesting observation is what this does to the real-bogus classifier sitting downstream. With LACosmic-as-defaults, the real-bogus classifier has to learn to also reject the LACosmic-missed cosmic rays, which forces a more conservative operating point and costs recall. With the Sky cosmic-ray U-Net in front, the real-bogus classifier sees a cleaner stream and the operating point that gives 0.5% contamination moves to a higher recall by roughly +2.8 points on the holdout. In other words: half of the headline "97.4% recall at 0.5% bogus" is the pre-step, and any paper that reports a real-bogus benchmark without specifying what cosmic-ray rejection was used upstream is reporting an incomplete number.

RA 21h 38m DEC −01°  ·  Section 03 — The architecture choice

End-to-end CNN versus hand-engineered features

The Solstice engineering desk spent two release cycles trying to make hand-engineered features work for real-bogus classification, primarily because hand-engineered features are easier to debug at three in the morning when an operations partner is on the phone about a strange alert. The features we tried — absorption ratio, PSF chi-squared, neighbouring-pixel correlation, several measures of the difference-image residual symmetry — got us to 94% recall at 0.5% contamination.

The end-to-end CNN reading 63×63 cutouts of (science, reference, difference, cosmic-ray-mask) reached 97.4% on the same operating point on the same data. We tried, twice, to push the hand-engineered features past the CNN by adding more features; the CNN had a structural advantage that we could not close. The deployment trade-off — less interpretable, harder to debug at three in the morning — was the right trade-off because the recall advantage translated directly into recovered transients in the operational stream.

RA 21h 56m DEC −00°  ·  Section 04 — Operating points

The shape of the curve

Operating pointRecall on real transientsUse case
1.0% bogus contamination99.0%Common academic benchmark · not deployable at survey scale
0.5% bogus contamination97.4%Sky default · operational at ZTF-class throughput
0.3% bogus contamination94.1%Partner broker A · tighter contamination, accepts the recall cost
0.1% bogus contamination85.6%Spectroscopic follow-up triage · only the most confident alerts

The 0.1% operating point is included because one partner uses it as the trigger for an automatic spectroscopic follow-up request, where the cost of acting on a false alert is high enough that they would rather miss real transients than spend telescope time on a cosmic ray.

RA 22h 14m DEC −00°  ·  Section 05 — Latency

Inside the survey-night clock

The operational constraint on a survey-classifier system is the survey-night clock: difference images come out at survey cadence, and the classifier sits in a pipeline whose latency budget is measured in seconds, not minutes. Sky's combined (cosmic-ray U-Net + real-bogus classifier + isotonic calibration) latency is 14 ms per alert on an A100 and 22 ms on an L4. A peak alert rate of 2 000/s is well within budget on a single A100.

This sounds like a trivial observation but it is the reason we chose the architecture we chose. A bigger network might have moved the headline metric by 0.5 points but would have failed the latency budget on the partner broker's existing hardware. Operational deployment is about the binding constraint, not the loosest one.

RA 22h 35m DEC +00°  ·  Section 06 — LSST-era considerations

What we expect to break

Sky's headline numbers are on ZTF data: M ≤ 21 alert depth, single-facility reference images, a single survey cadence. LSST will produce alerts at roughly an order of magnitude greater depth, with different difference-image properties (deeper templates, different PSF homogenisation, much higher alert volume per night). We expect the current Sky model to drop in performance on LSST data when it becomes available; we do not yet know by how much.

The architectural changes we anticipate: the cosmic-ray U-Net needs retraining on LSST-cadence data (cosmic-ray properties are detector-specific and we cannot transfer them blindly); the real-bogus classifier may need a deeper backbone to handle the increased complexity of difference images at LSST depth; the calibration assumption needs to be re-validated on the new alert volume. We have an internal plan for each of these and we will dispatch when there is a benchmark on real LSST data, not before.

Half of what makes a real-bogus classifier deployable is the cosmic-ray rejection that ran before it ever saw the cutout. Papers that omit the pre-step from the methods section are reporting a number that cannot be reproduced and should not be deployed on. — Solstice engineering desk · Dispatch 2026-01-25

If you run a survey alert broker

The honest first conversation is about your contamination budget and your latency budget. Email with the survey, the broker software, and the volume per night; we will reply with whether Sky fits or whether a different door is right.

Open contact →