2026-03-22

Dollar Bars: Why Volume-Based Sampling Beats OHLCV

Raw OHLCV data is sampled at fixed time intervals, which conflates calendar noise with market activity. Dollar bars fix this by sampling on traded value.

The problem with time bars

Standard OHLCV data gives you one bar per fixed time interval — one minute, one day. This seems natural, but it has a statistical cost: market activity is not uniform across time.

Pre-market hours are thin. The open and close are chaotic. Lunch is quiet. A one-minute bar at 9:31 contains far more information than one at 13:00. When you feed this data to an ML model, you're training on a signal with heteroskedastic, serially correlated noise baked in by construction.

What dollar bars do differently

A dollar bar closes when a threshold of traded dollar value is reached — for example, every $10M in notional flow. This means:

Bars are denser during high-activity periods (earnings, macro events)
Bars are wider during thin markets (weekends, pre-market)
The resulting series has closer to i.i.d. properties (López de Prado, AFML Ch. 2)

This matters enormously for any downstream statistical test, label construction, or ML feature engineering.

The sampling formula

A dollar bar increments a running counter:

counter += price × volume
if counter >= threshold:
    emit_bar()
    counter = 0

The threshold is typically set so bars are emitted at roughly the same frequency as daily OHLCV — e.g. threshold = daily_avg_dollar_volume / N_bars_per_day.

Why this is step one in the AFML pipeline

The canonical pipeline from López de Prado is:

Dollar bars — stationary, activity-adjusted sampling
CUSUM filter — structural break event detection
Triple-barrier labeling — price-path labels
Meta-labeling — secondary classifier gate
Fractional differentiation — memory-preserving stationarity
ML model — trained on clean, labeled features

Skipping dollar bars means your CUSUM thresholds are calibrated on a noisy, time-heterogeneous series. Your labels will be polluted by calendar effects. Your ML model will partially learn "what time of day is it?" instead of "is there a tradeable signal here?"

Implementation note

In the finmlresearch.com pipeline, dollar bars are constructed in the ohlcv_pipeline.py fetch step and stored as Parquet files per ticker. DuckDB then provides a lazy glob scan across the entire universe for any downstream analysis.

This post is part of the AFML implementation series — translating López de Prado's methodology into production Python.