The gauntlet
The gauntlet is the heart of ClearEdge: a deliberately strict, anti-overfit pipeline that decides whether a strategy has a real edge or just a lucky fit. A positive backtest is the start of the conversation, never the end. Each stage protects against a specific way to fool yourself.
You can run these stages individually from the Backtests tab (each is a run kind), as one
command from the CLI (python -m ats.research), or unattended across the whole basket
(python -m ats.sweep). Their combined verdict drives the readiness score on
the research ladder.
The stages
Section titled “The stages”1. Single backtest — the baseline
Section titled “1. Single backtest — the baseline”One historical simulation over the chosen window. Reads only the local catalog, fills on the next bar (no look-ahead). Produces the full metrics, the equity curve overlaid with buy-and-hold, a drawdown subplot, the round-trips, and a Monte Carlo panel. What it protects against: nothing yet — it’s the hypothesis, not the proof.
2. Rule test (vs random) — does the entry matter?
Section titled “2. Rule test (vs random) — does the entry matter?”The strategy runs for real, then N times with its entry rule replaced by random entries at the same frequency (same exits, sizing, fees, data). If the real Sharpe doesn’t clear the 95th percentile of that random-entry cloud, the entry timing is adding little — the exits/holding are doing the work. Protects against: mistaking a generic holding-period effect for a real signal.
3. Optimize (Optuna) — with a train/test split
Section titled “3. Optimize (Optuna) — with a train/test split”A seeded TPE search over the strategy’s search space, one backtest per trial. Set a train/test split date and the optimizer only ever sees the training window; the top-K trials each get a Monte Carlo verdict and held-out test metrics, and the pipeline recommends the best candidate that isn’t lucky. Protects against: picking parameters by peeking at the test set.
4. Walk-forward & WFO — does it survive out-of-sample?
Section titled “4. Walk-forward & WFO — does it survive out-of-sample?”Walk-forward runs the same parameters on N out-of-sample windows. WFO (walk-forward optimization) is stronger: parameters are re-optimized on each window using only the past, traded on the next window, and the out-of-sample steps are stitched into one honest equity curve. WFO’s detail view includes a parameter-stability table — wild swings window-to-window mean the optimizer is fitting noise. Protects against: a single in-sample fit that won’t generalize.
5. Monte Carlo — how lucky was the ordering?
Section titled “5. Monte Carlo — how lucky was the ordering?”Resamples the trade sequence to estimate the distribution of outcomes. A strategy whose result depends on one fortunate ordering of trades is fragile. Protects against: path luck.
6. Deflated Sharpe (DSR) — corrected for the search
Section titled “6. Deflated Sharpe (DSR) — corrected for the search”The Sharpe ratio discounted for how many variants were tried (ats/workers/deflated_sharpe.py).
Searching hard for a high Sharpe is mass multiple-testing; without this correction you crown noise. A
DSR above ~0.95 is the hard gate — confidence the edge is real, not the luckiest trial.
Protects against: selection bias from trying many strategies/parameters.
7. Cost stress — does the edge survive friction?
Section titled “7. Cost stress — does the edge survive friction?”Re-scores the strategy at, e.g., 2× the modeled trading costs. Thin edges evaporate under realistic or doubled costs. Protects against: an edge that exists only at zero/under-modeled cost.
8. Buy-and-hold benchmark — is it worth the complexity?
Section titled “8. Buy-and-hold benchmark — is it worth the complexity?”Every run reports the instrument’s buy-and-hold return and the strategy’s excess over it. Beating buy-and-hold (net of cost and risk) is the bar that actually matters. Protects against: dressing up plain market exposure as alpha.
9. Cross-sectional breadth — does it show up on many names?
Section titled “9. Cross-sectional breadth — does it show up on many names?”A real edge appears across many independent instruments, not two or three lucky ones. Breadth is a score bonus, never a gate, so a genuinely asset-specific edge (a structural pair, a sector anomaly) still qualifies — but a “winner” that works on only one name is treated with suspicion. See multi-instrument engines.
The hard gates
Section titled “The hard gates”A strategy earns a ⭐ on the research ladder only when it clears all of:
- the rule test is significant (beats the 95th percentile of random entry), and
- held-out Sharpe > 0.5, and
- walk-forward out-of-sample Sharpe > 0.5, and
- deflated Sharpe > ~0.95.
These bars are deliberately strict: a positive rule test plus a positive held-out Sharpe is not enough. Monte Carlo and cross-sectional breadth are visible signals that inform the score but are not gates. Full scoring is on the research ladder.
The recurring lesson
Section titled “The recurring lesson”Across the built-in research, the gauntlet keeps teaching the same thing: simple edges win, and added machinery rarely adds rule-test significance. The honest record — what survived and what was refuted — is in the strategy library and the repo’s research notes.