Phase 11a — Pennant criteria A/B test¶

Side-by-side comparison of the production bull pennant detector under current criteria (Baseline) vs proposed tightened criteria (Variant) over the full 2007–2026 historical dataset. Analytical only — no production code, config, or tables were modified.

Override mechanism (Approach a): the production pennant detector reads its thresholds from uriel.config.get_config(), which returns a (mutable) Pydantic v2 model. The harness at ab_test/run_ab.py calls get_config() once per run and mutates the four duration fields before invoking uriel.detect.pennant._detect_for_ticker(...) directly. Events are returned in-memory and written to parquet under ab_test/; production pattern_events is untouched. Outcomes (MFE/MAE/endpoints) are computed inline using the same anchor-relative formula as uriel.outcomes.profiler._profile_one_event (forward 30 trading days, percent vs anchor close), but written to parquet rather than into pattern_events. Note the v1.4/v1.5 Charter §7.5 Q2 minimum (5 bars) is itself a parameter under test here — no implicit charter override required.

1. Configuration¶

Parameter	Baseline	Variant
pennant.min_duration_bars	5	10
pennant.max_duration_bars	15	20
flagpole.min_duration_bars	1	1
flagpole.max_duration_bars	10	5
pennant.max_retrace_pct	0.382	0.382
flagpole.min_magnitude_pct	12.0	12.0
flagpole.min_atr_multiple	4.0	4.0
flagpole.volume_ratio_min	1.5	1.5
trend_filter (EMA_55 ≥ 10d prior)	on	on

Date range scanned: 2007-02-15 → 2026-05-08 (≈20 years). Universe: 2,974 active tickers; 2,413 had ≥300 bars to qualify for scanning. Earliest variant anchor: 2007-02-16; both runs span the full window.

2. Detection counts¶

Baseline: 15,534 events (matches the production pattern_events count exactly — confidence that the harness reproduces production output)
Variant: 5,155 events
Variant / Baseline ratio: 0.332 — variant finds roughly one in three of the patterns baseline finds.

Per-year counts¶

Year	Baseline	Variant	Variant/Baseline
2007	400	127	0.32
2008	226	69	0.31
2009	702	238	0.34
2010	619	172	0.28
2011	471	158	0.34
2012	564	187	0.33
2013	875	233	0.27
2014	498	178	0.36
2015	533	186	0.35
2016	806	249	0.31
2017	871	301	0.35
2018	719	253	0.35
2019	748	260	0.35
2020	1,168	357	0.31
2021	1,336	455	0.34
2022	533	189	0.35
2023	1,024	349	0.34
2024	1,780	662	0.37
2025	1,232	397	0.32
2026 (YTD)	429	135	0.31

The ratio is remarkably stable across regimes (0.27 – 0.37). No year shows the variant disproportionately favouring or punishing a particular regime.

Per-sector counts¶

Sector	Baseline	Variant	V/B
Healthcare	3,206	1,091	0.34
Technology	2,694	943	0.35
Industrials	2,538	869	0.34
Financial Services	2,067	600	0.29
Consumer Cyclical	1,916	652	0.34
Consumer Defensive	744	235	0.32
Energy	677	216	0.32
Basic Materials	672	213	0.32
Communication Services	590	202	0.34
Real Estate	328	106	0.32
Utilities	102	28	0.27

Sector mix is preserved; no sector is disproportionately filtered.

3. Overlap analysis¶

Measure	Count
Exact-anchor match (same symbol + same `event_date`)	1,339
Fuzzy match within ±1 calendar day (baseline events with ≥1 nearby variant)	2,190
Fuzzy match within ±1 calendar day (variant events with ≥1 nearby baseline)	2,190
Baseline-only events (no variant within ±1 day)	13,344
Variant-only events (no baseline within ±1 day)	2,965

Interpretation. Only ≈8.6 % of baseline events have an exact variant counterpart, and only ≈14 % match even with a ±1-day tolerance. Of the 5,155 variant events, 2,965 (57 %) are not present in the baseline population at all — they are net-new detections. This is despite the variant having a stricter pennant.min_duration_bars and a stricter flagpole.max_duration_bars. Two mechanisms explain it:

The variant's wider pennant window (max 20 vs 15) admits longer consolidations baseline rejects.
The detector's inner-loop dedup (skip_until_idx = end_idx + min_dur and break on the first qualifying win_len at each end_idx) means that changing min_duration_bars changes which anchor "wins" at each symbol, so the two configurations can land on different anchor dates for the same underlying pattern.

The change is therefore not incremental — variant produces a qualitatively different event population, not a strict subset of baseline.

4. MFE distribution comparison¶

Stat (MFE %)	Baseline	Variant
Mean	13.92	14.03
Median	7.50	7.47
P25	2.67	2.63
P75	16.01	15.97
P90	30.87	30.54

Hit-rate at common MFE thresholds¶

Threshold	Baseline	Variant
≥ 5 %	61.5 %	62.0 %
≥ 10 %	40.5 %	40.1 %
≥ 15 %	27.0 %	27.0 %
≥ 20 %	18.9 %	18.7 %
≥ 30 %	10.5 %	10.5 %
≥ 50 %	4.3 %	4.4 %

The MFE distributions are statistically indistinguishable at every percentile and every threshold.

5. MAE distribution comparison¶

Stat (MAE %)	Baseline	Variant
Mean	−9.54	−9.68
Median	−6.59	−6.74
P25	−13.64	−14.05
P75	−2.23	−2.15
P10	−23.21	−23.48

Stop-loss-relevant loss rates¶

MAE worse than…	Baseline	Variant
−5 %	58.5 %	58.7 %
−7 %	47.9 %	48.7 %
−10 %	35.9 %	36.4 %
−15 %	21.7 %	22.4 %

Variant patterns have marginally deeper drawdowns at every level (differences of 0.2 – 0.8 percentage points). The effect is small and consistent — variant is not improving the downside profile.

6. Time-to-MFE-peak comparison¶

Stat (days to MFE)	Baseline	Variant
Mean	15.7	15.5
Median	16	16
P25	5	5
P75	26	26
P90	30	30

Bucket distribution¶

Days-to-peak	Baseline	Variant
1 – 5	25.4 %	25.7 %
6 – 10	12.8 %	13.3 %
11 – 15	11.0 %	10.7 %
16 – 20	10.8 %	11.4 %
21 – 30	39.9 %	38.8 %

The two populations resolve at indistinguishable speeds. The U-shaped profile (high mass at 1–5 and 21–30 days) is present in both — patterns that work tend to work fast or take the full window.

7. Quality vs quantity tradeoff¶

Expectancy proxy = mean(MFE) × P(MFE ≥ 15 %).

Metric	Baseline	Variant
n (with outcomes)	15,528	5,154
mean MFE %	13.92	14.03
P(MFE ≥ 15 %)	27.0 %	27.0 %
Per-pattern proxy	3.76	3.79
Population-total proxy (per-pattern × n)	58,370	19,561

Interpretation. Per-pattern quality is essentially identical (+0.8 % on mean MFE, identical hit-rate at +15 %). But the variant detects only 33 % as many patterns, so the total expected-return contribution from the variant population is about one-third of the baseline's. The variant is fewer patterns at the same per-pattern quality, not "fewer but better." Endpoint returns at 5, 10, 20, 30 days tell the same story — means within 0.2 percentage points, medians within 0.2 percentage points, P75/P90 essentially overlapping.

The downside picture is marginally worse for the variant (0.2 – 0.8 pp higher loss rates at every threshold), suggesting the wider pennant window (up to 20 bars) admits some consolidations that decay rather than coil.

8. Recommendation summary¶

The variant would be a clear improvement over the baseline if and only if the consumer of the detector values per-pattern selectivity over total coverage, and is willing to accept marginally deeper drawdowns in exchange for a smaller, cleaner candidate set. On the quality dimensions actually measured (forward MFE, MAE, time-to-peak, endpoint returns) the two populations are statistically equivalent; the variant produces no measurable lift in any of them. The decision is therefore not a "better detector" question but a "right population size" question — fewer-but-equivalent patterns vs more-but-equivalent patterns. El Don decides.

Artifacts preserved under ab_test/: run_ab.py, analyze.py, baseline_events.parquet, variant_events.parquet, baseline_outcomes.parquet, variant_outcomes.parquet, summary.json, run.log.