The Brutal Truth About ML Trading: Why Your XGBoost Model Keeps Failing (And What Actually Works)

"We spent 6 months building the perfect XGBoost model. Walk-forward validation: 50% accuracy. Sharpe: 0.18. Then we discovered the real problem wasn't the model — it was the data we were feeding it."

The Journey Nobody Warns You About

If you're reading this, you've probably been there:

Downloaded SPY/NIFTY OHLCV data from Yahoo Finance
Engineered 26 technical indicators (SMA, RSI, MACD, Bollinger Bands, ATR...)
Trained an XGBoost classifier with walk-forward validation
Got 51% accuracy and called it "slightly better than a coin flip"
Tried intraday data (1-min, 5-min, 15-min bars)
Added quality filters, VWAP distance, volume spikes
Result: 0 trades, negative Sharpe, 50% accuracy

Welcome to the club. We did all of this. And we documented every failure so you don't have to repeat them.

What We Built (And Why It Failed)

Phase 1: The SMA + Volume + VWAP + XGBoost Dream

Component	What We Did	What We Learned
Data	yfinance daily OHLCV, 2015-2024	2,500 bars, single asset
Features	9 core: ret_1, ret_5, vol_20, rsi_14, atr_14	All derived from same 5 columns
Validation	Single 80/20 split	Leakage risk — future data bleeding in
Target	Binary: next day up/down	Classification = wrong framing
Metric	Accuracy only	51% = statistically meaningless

The Reality Check: Walk-forward validation (5 folds, time-aware) dropped accuracy to 50.2%. Coin flip. Feature importance was flat across all features — no dominant signal.

Phase 2: Feature Explosion (26 Features)

We expanded to 26 research-backed indicators:

features = [sma_cross, vol_ratio, vwap_dist, adx_14, ret_1, rsi_14, macd, bb_position, obv, cci ...]

Result: Same 50.2% accuracy. Sharpe 0.18. Flat SHAP importance across all 26.

Why: All 26 features were derived from the same 5 OHLCV columns. Multicollinearity was 0.8+. Not 26 signals — 26 formulas for the same signal.

Phase 3: The Quality Filter Trap

We filtered aggressively to only high-quality setups:

df[quality] = sma_cross==1 AND vol_ratio>1.2 AND vwap_dist>-0.05 AND adx_14>15

Result: 9.2% coverage. 0 trades. Negative Sharpe.

The Lesson: A quality filter can't create signal where none exists. It just filters your data into oblivion.

Phase 4: The Intraday Pivot

Timeframe	Bars	Accuracy	Sharpe
1-min	49K	49%	-69
5-min	11K	49%	-33
15-min	4K	50%	-5

The Lesson: More bars does not equal more signal. Intraday noise > daily noise. Same ceiling every time.

The Kaggle Wake-Up Call

While we were struggling with 5 OHLCV features, the DRW Crypto Market Prediction competition was happening on Kaggle. The 1st place solution used 780 anonymized features and achieved 0.131 LB score.

What DRW Did (That We Didn't)

780 features → Correlation clustering (threshold=0.6) → ~60 medoids → Target correlation filter (|r| > 1e-4) → ~40 → XGB + SHAP (6-fold CV) → top 20 per fold → ~30 → Linear combinations + recycled dropped features → ~30+ synthetic → AutoEncoder → 8 deep features (major boost) → MLP (3-layer) + XGB ensemble = 0.131 LB

Aspect	DRW 1st Place	Our Attempt
Raw features	780 anonymized microstructure	5 OHLCV
Feature selection	Clustering + SHAP + mRMR + LOFO	None
Final features	8 deep (AutoEncoder compressed)	5-9 raw
CV	Purged group time series, 6-fold, 2-month gap	Simple walk-forward
Model	MLP + XGB ensemble	XGB only
Loss	Custom: 0.6×MSE + 0.4×Pearson	Default logloss
Target	Regression (price movement magnitude)	Binary up/down

The 11th Place Surprise

11th place used just 29 features + 12 LinearRegression models and scored 0.111.

The Lesson: Feature source matters more than model complexity. 29 raw cross-sectional features beat 26 engineered OHLCV features every time.

The Academic Validation

Paper 1 — Walk-Forward Validation Study (2025)

Daily OHLCV signals on 100 US equities. Result: 0.55% annualized return, Sharpe 0.33, p-value 0.34. Statistically insignificant. Signals only work during high volatility regimes.

Paper 2 — S&P 500 Deep Learning (2023)

Claims strong results — but uses 22 stocks plus 9 fundamental ratios (P/B, EPS, P/E). Pure technical alone performs worse.

Paper 3 — Chinese A-Shares XGBoost

Sharpe 3.113 — but China market is less efficient, 518 stocks, fundamental plus technical hybrid. Not reproducible on SPY.

The Pattern: Every "successful" single-asset study either uses in-sample optimization, fundamental data, or is non-reproducible.

Why Single-Ticker SPY Is a Dead End

1. Efficient Market Hypothesis

SPY is the most analyzed ETF in the world. Any OHLCV pattern is instantly arbitraged by HFT firms running the same indicators faster, with tick data and factor models.

2. No Relative Anchor

Pairs trading (SPY-VIX, SPY-QQQ) has a mean-reverting spread — a statistical equilibrium. Single-ticker has no equilibrium to revert to. It is a random walk with drift.

3. Multicollinearity

All 26 indicators are derived from Close, High, Low, Volume. Same information, different math. Adding more indicators does not add new information — it adds noise.

4. The VWAP Lie

A common mistake is calculating VWAP as a cumulative average since day 1. Real VWAP is a per-minute volume-weighted average from every trade in that bar — a completely different signal that cannot be reproduced from daily OHLCV alone.

What Actually Works (Per Player Type)

Player	Method	Data	Edge
HFT (Jane Street, Virtu)	Order book imbalance, microstructure	Tick/L2 data	Latency + signal
Quant Funds (Citadel, Two Sigma)	Cross-asset pairs, factor models	Multi-asset + alt data	Statistical arb
Retail	Trend following, VWAP pullback	OHLCV + discipline	Risk management
ML Engineer	Regime detection, vol forecasting	OHLCV + VIX + cross-asset	Directional consistency

The Retail Path Forward

1. Cross-Asset Pairs (Real Alpha Source)

Instead of SPY alone, use relationships between assets:

SPY-VIX: Correlation approximately -0.75, mean-reverting spread
SPY-QQQ: Beta approximately 0.95, residual = alpha source
NIFTY-BankNifty: Co-integrated pair (India)
Gold-Silver, Oil-Gas: Commodity pairs with physical anchors

2. Feature Engineering for Cross-Asset

Spread z-score — a mean-reverting signal

spread = spy_close - beta * qqq_close spread_z = (spread - spread.rolling(252).mean()) / spread.rolling(252).std()

Ratio deviation from historical norm

ratio = spy_close / vix_close ratio_dev = (ratio - ratio.rolling(63).mean()) / ratio.rolling(63).std()

3. Regime Detection Instead of Quality Filters

Separate trained models per market regime

regime = high_vol if vix > 25 else low_vol model = models[regime]

4. Free Macro Data to Add

Data	What It Tells You	Source
VIX	Fear gauge, inverse SPY	Yahoo Finance (^VIX)
10Y-2Y Yield Spread	Recession predictor	FRED (free)
DXY	Dollar strength	Yahoo Finance
Sector ETFs (XLF, XLE, XLK)	Rotation signal	Yahoo Finance
India VIX	Vol regime for NIFTY	NSE India

The Lorentzian Classification Lesson

TradingView's popular Lorentzian Classification indicator uses k-NN with Lorentzian distance (not Euclidean). Its 15 features include 10 custom external sources — VIX, volume profile, macro data.

Key insight: OHLCV is the canvas. External features are the paint. We were trying to paint a masterpiece with only one color.

Key Takeaways

"Single asset OHLCV = dead end. Cross-asset pairs = real alpha. More features or more bars won't fix a missing signal."

Feature engineering beats model choice — DRW's 8 AutoEncoder features outperformed our 26 indicators
Feature source beats feature count — 29 raw cross-sectional features beat 26 derived single-asset features
Cross-asset data is non-negotiable — SPY-VIX, SPY-QQQ, NIFTY-BankNifty
Walk-forward validation is honest — it shows no edge; that is the point
Intraday noise exceeds daily noise — more bars without more signal = worse results
OHLCV is context, not signal — use it for regime detection, not direction prediction
Simple models work if features are strong — LinearRegression + 29 features = 0.111 LB

Resources and References

DRW Crypto Market Prediction — 1st Place Solution (A_A, 2025)
DRW Crypto — 11th Place LinearRegression Solution (2025)
Walk-Forward Validation Study — Texas Tech (2025)
Lorentzian Classification Premium — jdehorty (TradingView)
Pairs Trading: Performance of a Relative-Value Arbitrage Rule — Gatev, Goetzmann, Rouwenhorst (Yale)

This article documents real experiments, real failures, and real learnings from building ML trading models. No cherry-picked results. No hindsight bias.

quizforml.com — Learn. Build. Fail. Learn Again.