Phase 11a — Pennant criteria A/B test¶
Side-by-side comparison of the production bull pennant detector under current criteria (Baseline) vs proposed tightened criteria (Variant) over the full 2007–2026 historical dataset. Analytical only — no production code, config, or tables were modified.
Override mechanism (Approach a): the production pennant detector reads
its thresholds from uriel.config.get_config(), which returns a (mutable)
Pydantic v2 model. The harness at ab_test/run_ab.py calls
get_config() once per run and mutates the four duration fields before
invoking uriel.detect.pennant._detect_for_ticker(...) directly. Events
are returned in-memory and written to parquet under ab_test/; production
pattern_events is untouched. Outcomes (MFE/MAE/endpoints) are computed
inline using the same anchor-relative formula as
uriel.outcomes.profiler._profile_one_event (forward 30 trading days,
percent vs anchor close), but written to parquet rather than into
pattern_events. Note the v1.4/v1.5 Charter §7.5 Q2 minimum (5 bars) is
itself a parameter under test here — no implicit charter override
required.
1. Configuration¶
| Parameter | Baseline | Variant |
|---|---|---|
| pennant.min_duration_bars | 5 | 10 |
| pennant.max_duration_bars | 15 | 20 |
| flagpole.min_duration_bars | 1 | 1 |
| flagpole.max_duration_bars | 10 | 5 |
| pennant.max_retrace_pct | 0.382 | 0.382 |
| flagpole.min_magnitude_pct | 12.0 | 12.0 |
| flagpole.min_atr_multiple | 4.0 | 4.0 |
| flagpole.volume_ratio_min | 1.5 | 1.5 |
| trend_filter (EMA_55 ≥ 10d prior) | on | on |
Date range scanned: 2007-02-15 → 2026-05-08 (≈20 years). Universe: 2,974 active tickers; 2,413 had ≥300 bars to qualify for scanning. Earliest variant anchor: 2007-02-16; both runs span the full window.
2. Detection counts¶
- Baseline: 15,534 events (matches the production
pattern_eventscount exactly — confidence that the harness reproduces production output) - Variant: 5,155 events
- Variant / Baseline ratio: 0.332 — variant finds roughly one in three of the patterns baseline finds.
Per-year counts¶
| Year | Baseline | Variant | Variant/Baseline |
|---|---|---|---|
| 2007 | 400 | 127 | 0.32 |
| 2008 | 226 | 69 | 0.31 |
| 2009 | 702 | 238 | 0.34 |
| 2010 | 619 | 172 | 0.28 |
| 2011 | 471 | 158 | 0.34 |
| 2012 | 564 | 187 | 0.33 |
| 2013 | 875 | 233 | 0.27 |
| 2014 | 498 | 178 | 0.36 |
| 2015 | 533 | 186 | 0.35 |
| 2016 | 806 | 249 | 0.31 |
| 2017 | 871 | 301 | 0.35 |
| 2018 | 719 | 253 | 0.35 |
| 2019 | 748 | 260 | 0.35 |
| 2020 | 1,168 | 357 | 0.31 |
| 2021 | 1,336 | 455 | 0.34 |
| 2022 | 533 | 189 | 0.35 |
| 2023 | 1,024 | 349 | 0.34 |
| 2024 | 1,780 | 662 | 0.37 |
| 2025 | 1,232 | 397 | 0.32 |
| 2026 (YTD) | 429 | 135 | 0.31 |
The ratio is remarkably stable across regimes (0.27 – 0.37). No year shows the variant disproportionately favouring or punishing a particular regime.
Per-sector counts¶
| Sector | Baseline | Variant | V/B |
|---|---|---|---|
| Healthcare | 3,206 | 1,091 | 0.34 |
| Technology | 2,694 | 943 | 0.35 |
| Industrials | 2,538 | 869 | 0.34 |
| Financial Services | 2,067 | 600 | 0.29 |
| Consumer Cyclical | 1,916 | 652 | 0.34 |
| Consumer Defensive | 744 | 235 | 0.32 |
| Energy | 677 | 216 | 0.32 |
| Basic Materials | 672 | 213 | 0.32 |
| Communication Services | 590 | 202 | 0.34 |
| Real Estate | 328 | 106 | 0.32 |
| Utilities | 102 | 28 | 0.27 |
Sector mix is preserved; no sector is disproportionately filtered.
3. Overlap analysis¶
| Measure | Count |
|---|---|
Exact-anchor match (same symbol + same event_date) |
1,339 |
| Fuzzy match within ±1 calendar day (baseline events with ≥1 nearby variant) | 2,190 |
| Fuzzy match within ±1 calendar day (variant events with ≥1 nearby baseline) | 2,190 |
| Baseline-only events (no variant within ±1 day) | 13,344 |
| Variant-only events (no baseline within ±1 day) | 2,965 |
Interpretation. Only ≈8.6 % of baseline events have an exact variant
counterpart, and only ≈14 % match even with a ±1-day tolerance. Of the
5,155 variant events, 2,965 (57 %) are not present in the baseline
population at all — they are net-new detections. This is despite the
variant having a stricter pennant.min_duration_bars and a stricter
flagpole.max_duration_bars. Two mechanisms explain it:
- The variant's wider pennant window (max 20 vs 15) admits longer consolidations baseline rejects.
- The detector's inner-loop dedup (
skip_until_idx = end_idx + min_durandbreakon the first qualifyingwin_lenat eachend_idx) means that changingmin_duration_barschanges which anchor "wins" at each symbol, so the two configurations can land on different anchor dates for the same underlying pattern.
The change is therefore not incremental — variant produces a qualitatively different event population, not a strict subset of baseline.
4. MFE distribution comparison¶
| Stat (MFE %) | Baseline | Variant |
|---|---|---|
| Mean | 13.92 | 14.03 |
| Median | 7.50 | 7.47 |
| P25 | 2.67 | 2.63 |
| P75 | 16.01 | 15.97 |
| P90 | 30.87 | 30.54 |
Hit-rate at common MFE thresholds¶
| Threshold | Baseline | Variant |
|---|---|---|
| ≥ 5 % | 61.5 % | 62.0 % |
| ≥ 10 % | 40.5 % | 40.1 % |
| ≥ 15 % | 27.0 % | 27.0 % |
| ≥ 20 % | 18.9 % | 18.7 % |
| ≥ 30 % | 10.5 % | 10.5 % |
| ≥ 50 % | 4.3 % | 4.4 % |
The MFE distributions are statistically indistinguishable at every percentile and every threshold.
5. MAE distribution comparison¶
| Stat (MAE %) | Baseline | Variant |
|---|---|---|
| Mean | −9.54 | −9.68 |
| Median | −6.59 | −6.74 |
| P25 | −13.64 | −14.05 |
| P75 | −2.23 | −2.15 |
| P10 | −23.21 | −23.48 |
Stop-loss-relevant loss rates¶
| MAE worse than… | Baseline | Variant |
|---|---|---|
| −5 % | 58.5 % | 58.7 % |
| −7 % | 47.9 % | 48.7 % |
| −10 % | 35.9 % | 36.4 % |
| −15 % | 21.7 % | 22.4 % |
Variant patterns have marginally deeper drawdowns at every level (differences of 0.2 – 0.8 percentage points). The effect is small and consistent — variant is not improving the downside profile.
6. Time-to-MFE-peak comparison¶
| Stat (days to MFE) | Baseline | Variant |
|---|---|---|
| Mean | 15.7 | 15.5 |
| Median | 16 | 16 |
| P25 | 5 | 5 |
| P75 | 26 | 26 |
| P90 | 30 | 30 |
Bucket distribution¶
| Days-to-peak | Baseline | Variant |
|---|---|---|
| 1 – 5 | 25.4 % | 25.7 % |
| 6 – 10 | 12.8 % | 13.3 % |
| 11 – 15 | 11.0 % | 10.7 % |
| 16 – 20 | 10.8 % | 11.4 % |
| 21 – 30 | 39.9 % | 38.8 % |
The two populations resolve at indistinguishable speeds. The U-shaped profile (high mass at 1–5 and 21–30 days) is present in both — patterns that work tend to work fast or take the full window.
7. Quality vs quantity tradeoff¶
Expectancy proxy = mean(MFE) × P(MFE ≥ 15 %).
| Metric | Baseline | Variant |
|---|---|---|
| n (with outcomes) | 15,528 | 5,154 |
| mean MFE % | 13.92 | 14.03 |
| P(MFE ≥ 15 %) | 27.0 % | 27.0 % |
| Per-pattern proxy | 3.76 | 3.79 |
| Population-total proxy (per-pattern × n) | 58,370 | 19,561 |
Interpretation. Per-pattern quality is essentially identical (+0.8 % on mean MFE, identical hit-rate at +15 %). But the variant detects only 33 % as many patterns, so the total expected-return contribution from the variant population is about one-third of the baseline's. The variant is fewer patterns at the same per-pattern quality, not "fewer but better." Endpoint returns at 5, 10, 20, 30 days tell the same story — means within 0.2 percentage points, medians within 0.2 percentage points, P75/P90 essentially overlapping.
The downside picture is marginally worse for the variant (0.2 – 0.8 pp higher loss rates at every threshold), suggesting the wider pennant window (up to 20 bars) admits some consolidations that decay rather than coil.
8. Recommendation summary¶
The variant would be a clear improvement over the baseline if and only if the consumer of the detector values per-pattern selectivity over total coverage, and is willing to accept marginally deeper drawdowns in exchange for a smaller, cleaner candidate set. On the quality dimensions actually measured (forward MFE, MAE, time-to-peak, endpoint returns) the two populations are statistically equivalent; the variant produces no measurable lift in any of them. The decision is therefore not a "better detector" question but a "right population size" question — fewer-but-equivalent patterns vs more-but-equivalent patterns. El Don decides.
Artifacts preserved under ab_test/:
run_ab.py, analyze.py, baseline_events.parquet,
variant_events.parquet, baseline_outcomes.parquet,
variant_outcomes.parquet, summary.json, run.log.