Oleg Sheynin c2f701e3a2 progress

2025-07-25 20:20:23 +00:00

9.3 KiB

Raw Blame History

Pairs Trading Backtest

This document provides a guide to understanding, configuring, and running the pairs trading backtest system.

Overview

The system is designed to backtest pairs trading strategies on historical market data. It allows users to select different strategies, configure parameters, and analyze the performance of these strategies.

Core Concepts

Trading Pair

A trading pair consists of two financial instruments (e.g., stocks or cryptocurrencies) whose prices are believed to have a long-term statistical relationship (cointegration). The strategy aims to profit from temporary deviations from this relationship.

Strategy

The system supports different strategies for identifying and exploiting trading opportunities. Each strategy has its own set of configurable parameters.

Trading Signals

Trading signals indicate when to open or close a position based on the configured strategy and parameters. These signals are typically generated when the "dis-equilibrium" (the deviation from the long-term relationship) crosses certain thresholds.

Running a Backtest

1. Configuration

The primary configuration for the backtest is managed in the src/pt_backtest.py file. Here, you will define which dataset to use (cryptocurrencies or equities) and which strategy to employ.

Choosing a Dataset:

You can switch between CRYPTO_CONFIG and EQT_CONFIG by uncommenting the desired configuration block:

# CONFIG = CRYPTO_CONFIG  # For cryptocurrency data
CONFIG = EQT_CONFIG      # For equity data

Each configuration dictionary specifies:

data_directory: Path to the data files.
datafiles: A list of database files to process. You can comment/uncomment specific files to include/exclude them from the backtest.
db_table_name: The name of the table within the SQLite database.
instruments: A list of symbols to consider for forming trading pairs.
trading_hours: Defines the session start and end times, crucial for equity markets.
stat_model_price: The column in the data to be used as the price (e.g., "close").
dis-equilibrium_open_trshld: The threshold (in standard deviations) of the dis-equilibrium for opening a trade.
dis-equilibrium_close_trshld: The threshold (in standard deviations) of the dis-equilibrium for closing an open trade.
training_minutes: The length of the rolling window (in minutes) used to train the model (e.g., calculate cointegration, mean, and standard deviation of the dis-equilibrium).
funding_per_pair: The amount of capital allocated to each trading pair.

Choosing a Strategy:

The system currently offers two main strategies: StaticFitStrategy and SlidingFitStrategy. You select a strategy by instantiating it:

# STRATEGY = StaticFitStrategy()
STRATEGY = SlidingFitStrategy()

StaticFitStrategy: This strategy fits the cointegration model once at the beginning of each trading day (or for the entire dataset if run on a single file without a rolling window logic in the strategy itself). The parameters (mean, standard deviation of dis-equilibrium) derived from this initial fit are used for generating trading signals throughout the day.
- Pros: Simpler, computationally less intensive.
- Cons: May not adapt well to changing market conditions during the day.
SlidingFitStrategy: This strategy uses a rolling window approach. The cointegration model and its parameters are re-estimated at regular intervals (defined by training_minutes and how the strategy implements the sliding window). This allows the strategy to adapt to evolving market dynamics.
- Pros: More adaptive to changing market conditions.
- Cons: Computationally more intensive. The training_minutes parameter is crucial here as it defines the look-back period for each re-estimation.

2. Parameters for Trading Signals

The key parameters that determine trading signals are primarily found within the CONFIG dictionaries:

dis-equilibrium_open_trshld: This is the number of standard deviations the current dis-equilibrium must move away from its mean (calculated during the training period) to trigger an opening signal.
- A higher value means the strategy will wait for a more significant deviation before entering a trade, leading to fewer but potentially more robust signals.
- A lower value means the strategy will enter trades on smaller deviations, leading to more frequent signals but potentially more false positives.
dis-equilibrium_close_trshld: This is the number of standard deviations the current dis-equilibrium must revert towards its mean (from its peak deviation) to trigger a closing signal.
- A higher value (closer to the dis-equilibrium_open_trshld) means the strategy will close trades more quickly as the dis-equilibrium starts to revert.
- A lower value (closer to zero) means the strategy will hold onto trades longer, waiting for the dis-equilibrium to revert more significantly towards the mean.
training_minutes:
- For StaticFitStrategy, this determines the initial period of data used to establish the cointegration relationship and calculate the baseline dis-equilibrium statistics for the entire trading day (or dataset portion being processed).
- For SlidingFitStrategy, this defines the length of the rolling window. The model is refit using data from the most recent training_minutes period. A shorter window makes the strategy more responsive to recent price action but might be more prone to noise. A longer window provides a more stable model but might be slower to adapt to new trends.

3. Running the Script

Once the configuration is set, you can run the backtest from your terminal:

python src/pt_backtest.py

The script will process each datafile specified in the CONFIG, create all possible unique pairs from the instruments list, and apply the chosen strategy.

4. Interpreting Results

The script will output:

Progress messages for each datafile being processed.
A summary of trades taken.
Grand totals of performance metrics (PnL, etc.).
A list of any outstanding positions at the end of the backtest.

The core logic for a pair involves:

Data Preparation: For each pair, relevant price series are extracted.
Training Phase (for SlidingFitStrategy, this happens repeatedly; for StaticFitStrategy, typically once per day/file):
- The get_datasets() method in TradingPair splits data into training and testing sets.
- check_cointegration() uses the Johansen test to see if the pair's price series are cointegrated within the current training window. If not, the pair is often skipped for that window.
- If cointegrated, fit_VECM() estimates a Vector Error Correction Model (VECM). The beta coefficients from this model define the cointegrating relationship (the "spread" or "dis-equilibrium series").
- training_mu_ (mean) and training_std_ (standard deviation) of this dis-equilibrium series are calculated. These are crucial for scaling the dis-equilibrium and setting trade thresholds.
Prediction/Trading Phase:
- The strategy iterates through the "testing" data points.
- For each point, the current dis-equilibrium is calculated using the beta from the VECM.
- This dis-equilibrium is then scaled: (current_disequilibrium - training_mu_) / training_std_.
- This scaled value is compared against dis-equilibrium_open_trshld and dis-equilibrium_close_trshld to generate buy/sell/close signals.

Customizing and Extending

Adding New Strategies: Create a new class that inherits from a base strategy class (if one exists) or implements a similar interface to StaticFitStrategy or SlidingFitStrategy. The core method to implement would be run_pair().
Modifying Data Loading: The tools/data_loader.py can be modified to support different data formats or sources.
Changing Cointegration/Model Parameters: The TradingPair class houses the VECM fitting and cointegration checks. You can adjust parameters like k_ar_diff in coint_johansen or the VECM model itself.

Important Considerations

Data Quality: Ensure your market data is clean, accurate, and properly formatted. Gaps or errors in data can significantly impact backtest results.
Transaction Costs: The current backtest might not explicitly model transaction costs (brokerage fees, slippage). These can have a significant impact on the profitability of high-frequency strategies. Consider adding a cost model to BacktestResult or within the strategy execution.
Look-ahead Bias: Be extremely careful to avoid look-ahead bias. Ensure that decisions at any point in time are made using only information that would have been available at that time. The use of training_df_ and testing_df_ in TradingPair is designed to help prevent this.
Overfitting: When optimizing parameters (dis-equilibrium_open_trshld, training_minutes, etc.), be mindful of overfitting to the historical data. A strategy that performs exceptionally well on past data may not perform well in the future. Use out-of-sample testing or walk-forward optimization for more robust validation.

This tutorial should provide a solid foundation for working with the pairs trading backtest system. Experiment with different configurations and strategies to find what works best for your chosen markets and instruments.

9.3 KiB Raw Blame History