DDIM for Financial Prediction

Author: Joshua Hughes

1. Introduction

This project implements a conditional Denoising Diffusion Implicit Model (DDIM) for the complex task of financial time series forecasting. Traditional forecasting models often struggle to capture the stochastic and non-linear nature of financial markets. This project moves beyond standard approaches by leveraging a generative model that can produce a distribution of plausible future price sequences, conditioned on specific contextual information.

The core innovation lies in its conditional architecture. The model's predictions are not generated in a vacuum; they are guided by:

Historical Price Action: A lookback window of recent market activity.
Ticker Identity: A learned embedding for each stock, allowing the model to understand the unique volatility and behavior of individual assets.
Cyclical Time Features: Sine and cosine transformations of the time of day and day of the week, enabling the model to learn intraday and intraweek patterns.

To handle the massive datasets typical in finance, the project employs an efficient HDF5-based data pipeline, ensuring scalability and low memory overhead during training.

2. Methodology: The DDIM Framework

The DDIM is a type of generative model that learns to create data by reversing a gradual noising process. This is broken into two key phases.

2.1. The Forward Process (Noising)

In the forward process, we take a clean data sample (a future sequence of prices) and systematically add Gaussian noise over a series of discrete time steps. By the end of this process, the original structured data is transformed into pure, random noise. The ForwardProcess class manages this, pre-calculating a noise schedule (betas, alphas) to control the noise level at each step. The model learns the underlying data distribution by being trained to reverse this very process.

2.2. The Reverse Process (Denoising for Prediction)

The reverse process is where prediction occurs. We start with random noise and iteratively denoise it to generate a clean, structured data sample. The DDIM is "implicit" and deterministic, meaning it follows a direct path from noise to a clean signal, making it much faster than traditional Denoising Diffusion Probabilistic Models (DDPMs).

The ddim_reverse_step function orchestrates this. At each step, the trained model predicts the noise present in the current sample, given the context (historical data, ticker ID). The DDIM formula then uses this noise prediction to calculate the slightly-less-noisy sample for the previous step, progressively refining the output until a clean prediction is formed.

3. Model Architecture

3.1. The Transformer Denoising Engine

The heart of the denoising process is a Transformer-based model (TransformerModel). Its sole purpose is not to predict the future directly, but to predict the noise that was added to a corrupted future sample.

The model takes as input:

A noisy target sequence.
The current time step t of the diffusion process.
The context sequence (historical data).
The ticker ID and a temporal ID for conditioning.

Embeddings for the ticker and time are concatenated with the input data. The Transformer's attention mechanism is uniquely suited to identify complex relationships between the historical context and the noisy future, allowing it to make a highly accurate noise prediction, which is the key to successful generation.

3.2. Data Representation: The "Ticker Snapshot"

The model operates on a simplified, multi-feature representation of the market at each time step. The input features include:

OHLC Prices: Open, High, Low, and Close.
Log-Volume Difference: To capture changes in trading activity.
Cyclical Time Features: Four features representing the time of day and day of week.

This 9-dimensional vector serves as the fundamental unit of data for both context and prediction.

4. Data and Training Pipeline

4.1. Efficient HDF5 Data Handling

To manage potentially terabytes of financial data, the project uses a pre-processing step (prepare_dataset.py) to convert raw CSV files into a single, compressed HDF5 file. The HDF5Dataset class then allows the PyTorch DataLoader to read batches directly from disk, keeping memory usage minimal and enabling fast, parallel data loading.

4.2. Training and Prediction

The training script (train.py) orchestrates the learning process. In each step, it takes a batch of data, adds a random amount of noise to the target sequence, and tasks the Transformer model with predicting that noise. The loss is calculated as the Mean Squared Error between the true noise and the predicted noise, with an additional smoothness penalty to encourage more realistic price trajectories.

The prediction script (predict.py) uses the trained model to generate new price sequences. It takes a historical context, feeds it to the reverse diffusion process, and generates multiple sample predictions to form a probabilistic forecast.

5. Evaluation and Findings

The model's performance is assessed in evaluate_model.py using metrics suited for probabilistic forecasts:

Continuous Ranked Probability Score (CRPS): A metric that compares the entire predictive distribution to the ground truth observation, rewarding both accuracy and sharpness of the forecast.
Mean Absolute Error (MAE): Calculated on the median of the prediction samples to gauge the accuracy of the central forecast.

Evaluation is performed using a sliding window approach over a held-out dataset to simulate real-world forecasting scenarios. The results, visible in the project's `results/` directory, demonstrate the model's ability to generate realistic and varied price scenarios.

5.1. Example Prediction Analysis

Figure 1: Sample prediction for the SPY ticker. The model takes the last 30 minutes of price data (blue line) as context and generates a distribution of 100 possible future price paths (light gray lines) for the next 30 minutes.

Key observations from this example include:

Probabilistic Range: The model does not predict a single price but a wide range of possibilities, capturing market uncertainty. The individual predictions show significant volatility, reflecting real-world price action.
Median Forecast: The red dashed line represents the median of all predictions, offering a more stable, central forecast. In this case, it suggests a slight upward trend followed by consolidation.
Predicted Extrema: The model provides a distribution for the maximum high and minimum low prices over the prediction window. For this forecast, the median of the potential highs is around $469.72, while the median of the potential lows is $459.05, defining a likely trading range.

This ability to generate a full distribution of outcomes is a key advantage of the DDIM approach, allowing for more sophisticated risk management and strategy development.

6. Conclusion

This project successfully demonstrates the application of a conditional Denoising Diffusion Implicit Model to the challenging domain of financial forecasting. By combining a powerful Transformer architecture with a principled, generative framework, the model captures the complex dynamics of financial time series.

The key strengths of this approach are its probabilistic nature, providing a distribution of outcomes rather than a single point forecast, and its conditional design, which allows for more nuanced and asset-specific predictions. The use of an efficient HDF5 data pipeline makes the system scalable and robust for large-scale financial modeling.