Back to Blog

Machine LearningJun 5, 202618 min read

The Brutal Truth About ML Trading: Why Your XGBoost Model Keeps Failing (And What Actually Works)

We built the "perfect" XGBoost trading model — walk-forward validation, 26 features, intraday data. Result: 50.2% accuracy, Sharpe 0.18, 0 trades. Every failure documented, and what the DRW Kaggle 1st place winner did instead.

The Brutal Truth About ML Trading: Why Your XGBoost Model Keeps Failing (And What Actually Works)

"We spent 6 months building the perfect XGBoost model. Walk-forward validation: 50% accuracy. Sharpe: 0.18. Then we discovered the real problem wasn't the model — it was the data we were feeding it."


The Journey Nobody Warns You About

If you're reading this, you've probably been there:

  • Downloaded SPY/NIFTY OHLCV data from Yahoo Finance
  • Engineered 26 technical indicators (SMA, RSI, MACD, Bollinger Bands, ATR...)
  • Trained an XGBoost classifier with walk-forward validation
  • Got 51% accuracy and called it "slightly better than a coin flip"
  • Tried intraday data (1-min, 5-min, 15-min bars)
  • Added quality filters, VWAP distance, volume spikes
  • Result: 0 trades, negative Sharpe, 50% accuracy

Welcome to the club. We did all of this. And we documented every failure so you don't have to repeat them.


What We Built (And Why It Failed)

Phase 1: The SMA + Volume + VWAP + XGBoost Dream

ComponentWhat We DidWhat We Learned
Datayfinance daily OHLCV, 2015-20242,500 bars, single asset
Features9 core: ret_1, ret_5, vol_20, rsi_14, atr_14All derived from same 5 columns
ValidationSingle 80/20 splitLeakage risk — future data bleeding in
TargetBinary: next day up/downClassification = wrong framing
MetricAccuracy only51% = statistically meaningless

The Reality Check: Walk-forward validation (5 folds, time-aware) dropped accuracy to 50.2%. Coin flip. Feature importance was flat across all features — no dominant signal.

Phase 2: Feature Explosion (26 Features)

We expanded to 26 research-backed indicators:

features = [sma_cross, vol_ratio, vwap_dist, adx_14, ret_1, rsi_14, macd, bb_position, obv, cci ...]

Result: Same 50.2% accuracy. Sharpe 0.18. Flat SHAP importance across all 26.

Why: All 26 features were derived from the same 5 OHLCV columns. Multicollinearity was 0.8+. Not 26 signals — 26 formulas for the same signal.

Phase 3: The Quality Filter Trap

We filtered aggressively to only high-quality setups:

df[quality] = sma_cross==1 AND vol_ratio>1.2 AND vwap_dist>-0.05 AND adx_14>15

Result: 9.2% coverage. 0 trades. Negative Sharpe.

The Lesson: A quality filter can't create signal where none exists. It just filters your data into oblivion.

Phase 4: The Intraday Pivot

TimeframeBarsAccuracySharpeTrades
1-min49K49%-690
5-min11K49%-330
15-min4K50%-50

The Lesson: More bars does not equal more signal. Intraday noise > daily noise. Same ceiling every time.


The Kaggle Wake-Up Call

While we were struggling with 5 OHLCV features, the DRW Crypto Market Prediction competition was happening on Kaggle. The 1st place solution used 780 anonymized features and achieved 0.131 LB score.

What DRW Did (That We Didn't)

780 features → Correlation clustering (threshold=0.6) → ~60 medoids → Target correlation filter (|r| > 1e-4) → ~40 → XGB + SHAP (6-fold CV) → top 20 per fold → ~30 → Linear combinations + recycled dropped features → ~30+ synthetic → AutoEncoder → 8 deep features (major boost) → MLP (3-layer) + XGB ensemble = 0.131 LB

AspectDRW 1st PlaceOur Attempt
Raw features780 anonymized microstructure5 OHLCV
Feature selectionClustering + SHAP + mRMR + LOFONone
Final features8 deep (AutoEncoder compressed)5-9 raw
CVPurged group time series, 6-fold, 2-month gapSimple walk-forward
ModelMLP + XGB ensembleXGB only
LossCustom: 0.6×MSE + 0.4×PearsonDefault logloss
TargetRegression (price movement magnitude)Binary up/down

The 11th Place Surprise

11th place used just 29 features + 12 LinearRegression models and scored 0.111.

The Lesson: Feature source matters more than model complexity. 29 raw cross-sectional features beat 26 engineered OHLCV features every time.


The Academic Validation

Paper 1 — Walk-Forward Validation Study (2025)

Daily OHLCV signals on 100 US equities. Result: 0.55% annualized return, Sharpe 0.33, p-value 0.34. Statistically insignificant. Signals only work during high volatility regimes.

Paper 2 — S&P 500 Deep Learning (2023)

Claims strong results — but uses 22 stocks plus 9 fundamental ratios (P/B, EPS, P/E). Pure technical alone performs worse.

Paper 3 — Chinese A-Shares XGBoost

Sharpe 3.113 — but China market is less efficient, 518 stocks, fundamental plus technical hybrid. Not reproducible on SPY.

The Pattern: Every "successful" single-asset study either uses in-sample optimization, fundamental data, or is non-reproducible.


Why Single-Ticker SPY Is a Dead End

1. Efficient Market Hypothesis

SPY is the most analyzed ETF in the world. Any OHLCV pattern is instantly arbitraged by HFT firms running the same indicators faster, with tick data and factor models.

2. No Relative Anchor

Pairs trading (SPY-VIX, SPY-QQQ) has a mean-reverting spread — a statistical equilibrium. Single-ticker has no equilibrium to revert to. It is a random walk with drift.

3. Multicollinearity

All 26 indicators are derived from Close, High, Low, Volume. Same information, different math. Adding more indicators does not add new information — it adds noise.

4. The VWAP Lie

A common mistake is calculating VWAP as a cumulative average since day 1. Real VWAP is a per-minute volume-weighted average from every trade in that bar — a completely different signal that cannot be reproduced from daily OHLCV alone.


What Actually Works (Per Player Type)

PlayerMethodDataEdge
HFT (Jane Street, Virtu)Order book imbalance, microstructureTick/L2 dataLatency + signal
Quant Funds (Citadel, Two Sigma)Cross-asset pairs, factor modelsMulti-asset + alt dataStatistical arb
RetailTrend following, VWAP pullbackOHLCV + disciplineRisk management
ML EngineerRegime detection, vol forecastingOHLCV + VIX + cross-assetDirectional consistency

The Retail Path Forward

1. Cross-Asset Pairs (Real Alpha Source)

Instead of SPY alone, use relationships between assets:

  • SPY-VIX: Correlation approximately -0.75, mean-reverting spread
  • SPY-QQQ: Beta approximately 0.95, residual = alpha source
  • NIFTY-BankNifty: Co-integrated pair (India)
  • Gold-Silver, Oil-Gas: Commodity pairs with physical anchors

2. Feature Engineering for Cross-Asset

Spread z-score — a mean-reverting signal

spread = spy_close - beta * qqq_close spread_z = (spread - spread.rolling(252).mean()) / spread.rolling(252).std()

Ratio deviation from historical norm

ratio = spy_close / vix_close ratio_dev = (ratio - ratio.rolling(63).mean()) / ratio.rolling(63).std()

3. Regime Detection Instead of Quality Filters

Separate trained models per market regime

regime = high_vol if vix > 25 else low_vol model = models[regime]

4. Free Macro Data to Add

DataWhat It Tells YouSource
VIXFear gauge, inverse SPYYahoo Finance (^VIX)
10Y-2Y Yield SpreadRecession predictorFRED (free)
DXYDollar strengthYahoo Finance
Sector ETFs (XLF, XLE, XLK)Rotation signalYahoo Finance
India VIXVol regime for NIFTYNSE India

The Lorentzian Classification Lesson

TradingView's popular Lorentzian Classification indicator uses k-NN with Lorentzian distance (not Euclidean). Its 15 features include 10 custom external sources — VIX, volume profile, macro data.

Key insight: OHLCV is the canvas. External features are the paint. We were trying to paint a masterpiece with only one color.


Key Takeaways

"Single asset OHLCV = dead end. Cross-asset pairs = real alpha. More features or more bars won't fix a missing signal."

  1. Feature engineering beats model choice — DRW's 8 AutoEncoder features outperformed our 26 indicators
  2. Feature source beats feature count — 29 raw cross-sectional features beat 26 derived single-asset features
  3. Cross-asset data is non-negotiable — SPY-VIX, SPY-QQQ, NIFTY-BankNifty
  4. Walk-forward validation is honest — it shows no edge; that is the point
  5. Intraday noise exceeds daily noise — more bars without more signal = worse results
  6. OHLCV is context, not signal — use it for regime detection, not direction prediction
  7. Simple models work if features are strong — LinearRegression + 29 features = 0.111 LB

Resources and References

  • DRW Crypto Market Prediction — 1st Place Solution (A_A, 2025)
  • DRW Crypto — 11th Place LinearRegression Solution (2025)
  • Walk-Forward Validation Study — Texas Tech (2025)
  • Lorentzian Classification Premium — jdehorty (TradingView)
  • Pairs Trading: Performance of a Relative-Value Arbitrage Rule — Gatev, Goetzmann, Rouwenhorst (Yale)

This article documents real experiments, real failures, and real learnings from building ML trading models. No cherry-picked results. No hindsight bias.

quizforml.com — Learn. Build. Fail. Learn Again.

The Brutal Truth About ML Trading: Why Your XGBoost Model Keeps Failing (And What Actually Works) | MLQuiz