The Brutal Truth About ML Trading: Why Your XGBoost Model Keeps Failing (And What Actually Works)
We built the "perfect" XGBoost trading model — walk-forward validation, 26 features, intraday data. Result: 50.2% accuracy, Sharpe 0.18, 0 trades. Every failure documented, and what the DRW Kaggle 1st place winner did instead.
The Brutal Truth About ML Trading: Why Your XGBoost Model Keeps Failing (And What Actually Works)
"We spent 6 months building the perfect XGBoost model. Walk-forward validation: 50% accuracy. Sharpe: 0.18. Then we discovered the real problem wasn't the model — it was the data we were feeding it."
The Journey Nobody Warns You About
If you're reading this, you've probably been there:
- Downloaded SPY/NIFTY OHLCV data from Yahoo Finance
- Engineered 26 technical indicators (SMA, RSI, MACD, Bollinger Bands, ATR...)
- Trained an XGBoost classifier with walk-forward validation
- Got 51% accuracy and called it "slightly better than a coin flip"
- Tried intraday data (1-min, 5-min, 15-min bars)
- Added quality filters, VWAP distance, volume spikes
- Result: 0 trades, negative Sharpe, 50% accuracy
Welcome to the club. We did all of this. And we documented every failure so you don't have to repeat them.
What We Built (And Why It Failed)
Phase 1: The SMA + Volume + VWAP + XGBoost Dream
| Component | What We Did | What We Learned |
|---|---|---|
| Data | yfinance daily OHLCV, 2015-2024 | 2,500 bars, single asset |
| Features | 9 core: ret_1, ret_5, vol_20, rsi_14, atr_14 | All derived from same 5 columns |
| Validation | Single 80/20 split | Leakage risk — future data bleeding in |
| Target | Binary: next day up/down | Classification = wrong framing |
| Metric | Accuracy only | 51% = statistically meaningless |
The Reality Check: Walk-forward validation (5 folds, time-aware) dropped accuracy to 50.2%. Coin flip. Feature importance was flat across all features — no dominant signal.
Phase 2: Feature Explosion (26 Features)
We expanded to 26 research-backed indicators:
features = [sma_cross, vol_ratio, vwap_dist, adx_14, ret_1, rsi_14, macd, bb_position, obv, cci ...]
Result: Same 50.2% accuracy. Sharpe 0.18. Flat SHAP importance across all 26.
Why: All 26 features were derived from the same 5 OHLCV columns. Multicollinearity was 0.8+. Not 26 signals — 26 formulas for the same signal.
Phase 3: The Quality Filter Trap
We filtered aggressively to only high-quality setups:
df[quality] = sma_cross==1 AND vol_ratio>1.2 AND vwap_dist>-0.05 AND adx_14>15
Result: 9.2% coverage. 0 trades. Negative Sharpe.
The Lesson: A quality filter can't create signal where none exists. It just filters your data into oblivion.
Phase 4: The Intraday Pivot
| Timeframe | Bars | Accuracy | Sharpe | Trades |
|---|---|---|---|---|
| 1-min | 49K | 49% | -69 | 0 |
| 5-min | 11K | 49% | -33 | 0 |
| 15-min | 4K | 50% | -5 | 0 |
The Lesson: More bars does not equal more signal. Intraday noise > daily noise. Same ceiling every time.
The Kaggle Wake-Up Call
While we were struggling with 5 OHLCV features, the DRW Crypto Market Prediction competition was happening on Kaggle. The 1st place solution used 780 anonymized features and achieved 0.131 LB score.
What DRW Did (That We Didn't)
780 features → Correlation clustering (threshold=0.6) → ~60 medoids → Target correlation filter (|r| > 1e-4) → ~40 → XGB + SHAP (6-fold CV) → top 20 per fold → ~30 → Linear combinations + recycled dropped features → ~30+ synthetic → AutoEncoder → 8 deep features (major boost) → MLP (3-layer) + XGB ensemble = 0.131 LB
| Aspect | DRW 1st Place | Our Attempt |
|---|---|---|
| Raw features | 780 anonymized microstructure | 5 OHLCV |
| Feature selection | Clustering + SHAP + mRMR + LOFO | None |
| Final features | 8 deep (AutoEncoder compressed) | 5-9 raw |
| CV | Purged group time series, 6-fold, 2-month gap | Simple walk-forward |
| Model | MLP + XGB ensemble | XGB only |
| Loss | Custom: 0.6×MSE + 0.4×Pearson | Default logloss |
| Target | Regression (price movement magnitude) | Binary up/down |
The 11th Place Surprise
11th place used just 29 features + 12 LinearRegression models and scored 0.111.
The Lesson: Feature source matters more than model complexity. 29 raw cross-sectional features beat 26 engineered OHLCV features every time.
The Academic Validation
Paper 1 — Walk-Forward Validation Study (2025)
Daily OHLCV signals on 100 US equities. Result: 0.55% annualized return, Sharpe 0.33, p-value 0.34. Statistically insignificant. Signals only work during high volatility regimes.
Paper 2 — S&P 500 Deep Learning (2023)
Claims strong results — but uses 22 stocks plus 9 fundamental ratios (P/B, EPS, P/E). Pure technical alone performs worse.
Paper 3 — Chinese A-Shares XGBoost
Sharpe 3.113 — but China market is less efficient, 518 stocks, fundamental plus technical hybrid. Not reproducible on SPY.
The Pattern: Every "successful" single-asset study either uses in-sample optimization, fundamental data, or is non-reproducible.
Why Single-Ticker SPY Is a Dead End
1. Efficient Market Hypothesis
SPY is the most analyzed ETF in the world. Any OHLCV pattern is instantly arbitraged by HFT firms running the same indicators faster, with tick data and factor models.
2. No Relative Anchor
Pairs trading (SPY-VIX, SPY-QQQ) has a mean-reverting spread — a statistical equilibrium. Single-ticker has no equilibrium to revert to. It is a random walk with drift.
3. Multicollinearity
All 26 indicators are derived from Close, High, Low, Volume. Same information, different math. Adding more indicators does not add new information — it adds noise.
4. The VWAP Lie
A common mistake is calculating VWAP as a cumulative average since day 1. Real VWAP is a per-minute volume-weighted average from every trade in that bar — a completely different signal that cannot be reproduced from daily OHLCV alone.
What Actually Works (Per Player Type)
| Player | Method | Data | Edge |
|---|---|---|---|
| HFT (Jane Street, Virtu) | Order book imbalance, microstructure | Tick/L2 data | Latency + signal |
| Quant Funds (Citadel, Two Sigma) | Cross-asset pairs, factor models | Multi-asset + alt data | Statistical arb |
| Retail | Trend following, VWAP pullback | OHLCV + discipline | Risk management |
| ML Engineer | Regime detection, vol forecasting | OHLCV + VIX + cross-asset | Directional consistency |
The Retail Path Forward
1. Cross-Asset Pairs (Real Alpha Source)
Instead of SPY alone, use relationships between assets:
- SPY-VIX: Correlation approximately -0.75, mean-reverting spread
- SPY-QQQ: Beta approximately 0.95, residual = alpha source
- NIFTY-BankNifty: Co-integrated pair (India)
- Gold-Silver, Oil-Gas: Commodity pairs with physical anchors
2. Feature Engineering for Cross-Asset
Spread z-score — a mean-reverting signal
spread = spy_close - beta * qqq_close spread_z = (spread - spread.rolling(252).mean()) / spread.rolling(252).std()
Ratio deviation from historical norm
ratio = spy_close / vix_close ratio_dev = (ratio - ratio.rolling(63).mean()) / ratio.rolling(63).std()
3. Regime Detection Instead of Quality Filters
Separate trained models per market regime
regime = high_vol if vix > 25 else low_vol model = models[regime]
4. Free Macro Data to Add
| Data | What It Tells You | Source |
|---|---|---|
| VIX | Fear gauge, inverse SPY | Yahoo Finance (^VIX) |
| 10Y-2Y Yield Spread | Recession predictor | FRED (free) |
| DXY | Dollar strength | Yahoo Finance |
| Sector ETFs (XLF, XLE, XLK) | Rotation signal | Yahoo Finance |
| India VIX | Vol regime for NIFTY | NSE India |
The Lorentzian Classification Lesson
TradingView's popular Lorentzian Classification indicator uses k-NN with Lorentzian distance (not Euclidean). Its 15 features include 10 custom external sources — VIX, volume profile, macro data.
Key insight: OHLCV is the canvas. External features are the paint. We were trying to paint a masterpiece with only one color.
Key Takeaways
"Single asset OHLCV = dead end. Cross-asset pairs = real alpha. More features or more bars won't fix a missing signal."
- Feature engineering beats model choice — DRW's 8 AutoEncoder features outperformed our 26 indicators
- Feature source beats feature count — 29 raw cross-sectional features beat 26 derived single-asset features
- Cross-asset data is non-negotiable — SPY-VIX, SPY-QQQ, NIFTY-BankNifty
- Walk-forward validation is honest — it shows no edge; that is the point
- Intraday noise exceeds daily noise — more bars without more signal = worse results
- OHLCV is context, not signal — use it for regime detection, not direction prediction
- Simple models work if features are strong — LinearRegression + 29 features = 0.111 LB
Resources and References
- DRW Crypto Market Prediction — 1st Place Solution (A_A, 2025)
- DRW Crypto — 11th Place LinearRegression Solution (2025)
- Walk-Forward Validation Study — Texas Tech (2025)
- Lorentzian Classification Premium — jdehorty (TradingView)
- Pairs Trading: Performance of a Relative-Value Arbitrage Rule — Gatev, Goetzmann, Rouwenhorst (Yale)
This article documents real experiments, real failures, and real learnings from building ML trading models. No cherry-picked results. No hindsight bias.
quizforml.com — Learn. Build. Fail. Learn Again.