The gauntlet

The gauntlet is the heart of ClearEdge: a deliberately strict, anti-overfit pipeline that decides whether a strategy has a real edge or just a lucky fit. A positive backtest is the start of the conversation, never the end. Each stage protects against a specific way to fool yourself.

You can run these stages individually from the Backtests tab (each is a run kind), as one command from the CLI (python -m ats.research), or unattended across the whole basket (python -m ats.sweep). Their combined verdict drives the readiness score on the research ladder.

The stages

1. Single backtest — the baseline

One historical simulation over the chosen window. Reads only the local catalog, fills on the next bar (no look-ahead). Produces the full metrics, the equity curve overlaid with buy-and-hold, a drawdown subplot, the round-trips, and a Monte Carlo panel. What it protects against: nothing yet — it’s the hypothesis, not the proof.

2. Rule test (vs random) — does the entry matter?

The strategy runs for real, then N times with its entry rule replaced by random entries at the same frequency (same exits, sizing, fees, data). If the real Sharpe doesn’t clear the 95th percentile of that random-entry cloud, the entry timing is adding little — the exits/holding are doing the work. Protects against: mistaking a generic holding-period effect for a real signal.

3. Optimize (Optuna) — with a train/test split

A seeded TPE search over the strategy’s search space, one backtest per trial. Set a train/test split date and the optimizer only ever sees the training window; the top-K trials each get a Monte Carlo verdict and held-out test metrics, and the pipeline recommends the best candidate that isn’t lucky. Protects against: picking parameters by peeking at the test set.

4. Walk-forward & WFO — does it survive out-of-sample?

Walk-forward runs the same parameters on N out-of-sample windows. WFO (walk-forward optimization) is stronger: parameters are re-optimized on each window using only the past, traded on the next window, and the out-of-sample steps are stitched into one honest equity curve. WFO’s detail view includes a parameter-stability table — wild swings window-to-window mean the optimizer is fitting noise. Protects against: a single in-sample fit that won’t generalize.

5. Monte Carlo — how lucky was the ordering?

Resamples the trade sequence to estimate the distribution of outcomes. A strategy whose result depends on one fortunate ordering of trades is fragile. Protects against: path luck.

6. Deflated Sharpe (DSR) — corrected for the search

The Sharpe ratio discounted for how many variants were tried (ats/workers/deflated_sharpe.py). Searching hard for a high Sharpe is mass multiple-testing; without this correction you crown noise. A DSR above ~0.95 is the hard gate — confidence the edge is real, not the luckiest trial. Protects against: selection bias from trying many strategies/parameters.

7. Cost stress — does the edge survive friction?

Re-scores the strategy at, e.g., 2× the modeled trading costs. Thin edges evaporate under realistic or doubled costs. Protects against: an edge that exists only at zero/under-modeled cost.

8. Buy-and-hold benchmark — is it worth the complexity?

Every run reports the instrument’s buy-and-hold return and the strategy’s excess over it. Beating buy-and-hold (net of cost and risk) is the bar that actually matters. Protects against: dressing up plain market exposure as alpha.

9. Cross-sectional breadth — does it show up on many names?

A real edge appears across many independent instruments, not two or three lucky ones. Breadth is a score bonus, never a gate, so a genuinely asset-specific edge (a structural pair, a sector anomaly) still qualifies — but a “winner” that works on only one name is treated with suspicion. See multi-instrument engines.

The hard gates

A strategy earns a ⭐ on the research ladder only when it clears all of:

the rule test is significant (beats the 95th percentile of random entry), and
held-out Sharpe > 0.5, and
walk-forward out-of-sample Sharpe > 0.5, and
deflated Sharpe > ~0.95.

These bars are deliberately strict: a positive rule test plus a positive held-out Sharpe is not enough. Monte Carlo and cross-sectional breadth are visible signals that inform the score but are not gates. Full scoring is on the research ladder.

The recurring lesson

Across the built-in research, the gauntlet keeps teaching the same thing: simple edges win, and added machinery rarely adds rule-test significance. The honest record — what survived and what was refuted — is in the strategy library and the repo’s research notes.