pairs_trading/README.md
Oleg Sheynin c2f701e3a2 progress
2025-07-25 20:20:23 +00:00

132 lines
9.3 KiB
Markdown

# Pairs Trading Backtest
This document provides a guide to understanding, configuring, and running the pairs trading backtest system.
## Overview
The system is designed to backtest pairs trading strategies on historical market data.
It allows users to select different strategies, configure parameters, and analyze the
performance of these strategies.
## Core Concepts
### Trading Pair
A trading pair consists of two financial instruments (e.g., stocks or cryptocurrencies)
whose prices are believed to have a long-term statistical relationship (cointegration).
The strategy aims to profit from temporary deviations from this relationship.
### Strategy
The system supports different strategies for identifying and exploiting trading opportunities. Each strategy has its own set of configurable parameters.
### Trading Signals
Trading signals indicate when to open or close a position based on the configured strategy
and parameters. These signals are typically generated when the "dis-equilibrium" (the
deviation from the long-term relationship) crosses certain thresholds.
## Running a Backtest
### 1. Configuration
The primary configuration for the backtest is managed in the `src/pt_backtest.py` file. Here, you will define which dataset to use (cryptocurrencies or equities) and which strategy to employ.
#### Choosing a Dataset:
You can switch between `CRYPTO_CONFIG` and `EQT_CONFIG` by uncommenting the desired configuration block:
```python
# CONFIG = CRYPTO_CONFIG # For cryptocurrency data
CONFIG = EQT_CONFIG # For equity data
```
Each configuration dictionary specifies:
- `data_directory`: Path to the data files.
- `datafiles`: A list of database files to process. You can comment/uncomment specific files to include/exclude them from the backtest.
- `db_table_name`: The name of the table within the SQLite database.
- `instruments`: A list of symbols to consider for forming trading pairs.
- `trading_hours`: Defines the session start and end times, crucial for equity markets.
- `stat_model_price`: The column in the data to be used as the price (e.g., "close").
- `dis-equilibrium_open_trshld`: The threshold (in standard deviations) of the dis-equilibrium for opening a trade.
- `dis-equilibrium_close_trshld`: The threshold (in standard deviations) of the dis-equilibrium for closing an open trade.
- `training_minutes`: The length of the rolling window (in minutes) used to train the model (e.g., calculate cointegration, mean, and standard deviation of the dis-equilibrium).
- `funding_per_pair`: The amount of capital allocated to each trading pair.
#### Choosing a Strategy:
The system currently offers two main strategies: `StaticFitStrategy` and `SlidingFitStrategy`. You select a strategy by instantiating it:
```python
# STRATEGY = StaticFitStrategy()
STRATEGY = SlidingFitStrategy()
```
- **`StaticFitStrategy`**: This strategy fits the cointegration model once at the beginning
of each trading day (or for the entire dataset if run on a single file without a rolling
window logic in the strategy itself). The parameters (mean, standard deviation of
dis-equilibrium) derived from this initial fit are used for generating trading signals
throughout the day.
- **Pros**: Simpler, computationally less intensive.
- **Cons**: May not adapt well to changing market conditions during the day.
- **`SlidingFitStrategy`**: This strategy uses a rolling window approach. The cointegration model and its parameters are re-estimated at regular intervals (defined by `training_minutes` and how the strategy implements the sliding window). This allows the strategy to adapt to evolving market dynamics.
- **Pros**: More adaptive to changing market conditions.
- **Cons**: Computationally more intensive. The `training_minutes` parameter is crucial here as it defines the look-back period for each re-estimation.
### 2. Parameters for Trading Signals
The key parameters that determine trading signals are primarily found within the `CONFIG` dictionaries:
- **`dis-equilibrium_open_trshld`**: This is the number of standard deviations the current dis-equilibrium must move away from its mean (calculated during the training period) to trigger an opening signal.
- A *higher* value means the strategy will wait for a more significant deviation before entering a trade, leading to fewer but potentially more robust signals.
- A *lower* value means the strategy will enter trades on smaller deviations, leading to more frequent signals but potentially more false positives.
- **`dis-equilibrium_close_trshld`**: This is the number of standard deviations the current dis-equilibrium must revert towards its mean (from its peak deviation) to trigger a closing signal.
- A *higher* value (closer to the `dis-equilibrium_open_trshld`) means the strategy will close trades more quickly as the dis-equilibrium starts to revert.
- A *lower* value (closer to zero) means the strategy will hold onto trades longer, waiting for the dis-equilibrium to revert more significantly towards the mean.
- **`training_minutes`**:
- For `StaticFitStrategy`, this determines the initial period of data used to establish the cointegration relationship and calculate the baseline dis-equilibrium statistics for the entire trading day (or dataset portion being processed).
- For `SlidingFitStrategy`, this defines the length of the rolling window. The model is refit using data from the most recent `training_minutes` period. A shorter window makes the strategy more responsive to recent price action but might be more prone to noise. A longer window provides a more stable model but might be slower to adapt to new trends.
### 3. Running the Script
Once the configuration is set, you can run the backtest from your terminal:
```bash
python src/pt_backtest.py
```
The script will process each datafile specified in the `CONFIG`, create all possible unique pairs from the `instruments` list, and apply the chosen strategy.
### 4. Interpreting Results
The script will output:
- Progress messages for each datafile being processed.
- A summary of trades taken.
- Grand totals of performance metrics (PnL, etc.).
- A list of any outstanding positions at the end of the backtest.
The core logic for a pair involves:
1. **Data Preparation**: For each pair, relevant price series are extracted.
2. **Training Phase** (for `SlidingFitStrategy`, this happens repeatedly; for `StaticFitStrategy`, typically once per day/file):
* The `get_datasets()` method in `TradingPair` splits data into training and testing sets.
* `check_cointegration()` uses the Johansen test to see if the pair's price series are cointegrated within the current training window. If not, the pair is often skipped for that window.
* If cointegrated, `fit_VECM()` estimates a Vector Error Correction Model (VECM). The `beta` coefficients from this model define the cointegrating relationship (the "spread" or "dis-equilibrium series").
* `training_mu_` (mean) and `training_std_` (standard deviation) of this dis-equilibrium series are calculated. These are crucial for scaling the dis-equilibrium and setting trade thresholds.
3. **Prediction/Trading Phase**:
* The strategy iterates through the "testing" data points.
* For each point, the current dis-equilibrium is calculated using the `beta` from the VECM.
* This dis-equilibrium is then scaled: `(current_disequilibrium - training_mu_) / training_std_`.
* This scaled value is compared against `dis-equilibrium_open_trshld` and `dis-equilibrium_close_trshld` to generate buy/sell/close signals.
## Customizing and Extending
- **Adding New Strategies**: Create a new class that inherits from a base strategy class (if one exists) or implements a similar interface to `StaticFitStrategy` or `SlidingFitStrategy`. The core method to implement would be `run_pair()`.
- **Modifying Data Loading**: The `tools/data_loader.py` can be modified to support different data formats or sources.
- **Changing Cointegration/Model Parameters**: The `TradingPair` class houses the VECM fitting and cointegration checks. You can adjust parameters like `k_ar_diff` in `coint_johansen` or the `VECM` model itself.
## Important Considerations
- **Data Quality**: Ensure your market data is clean, accurate, and properly formatted. Gaps or errors in data can significantly impact backtest results.
- **Transaction Costs**: The current backtest might not explicitly model transaction costs (brokerage fees, slippage). These can have a significant impact on the profitability of high-frequency strategies. Consider adding a cost model to `BacktestResult` or within the strategy execution.
- **Look-ahead Bias**: Be extremely careful to avoid look-ahead bias. Ensure that decisions at any point in time are made using only information that would have been available at that time. The use of `training_df_` and `testing_df_` in `TradingPair` is designed to help prevent this.
- **Overfitting**: When optimizing parameters (`dis-equilibrium_open_trshld`, `training_minutes`, etc.), be mindful of overfitting to the historical data. A strategy that performs exceptionally well on past data may not perform well in the future. Use out-of-sample testing or walk-forward optimization for more robust validation.
This tutorial should provide a solid foundation for working with the pairs trading backtest system. Experiment with different configurations and strategies to find what works best for your chosen markets and instruments.