# Pairs Trading Backtest This document provides a guide to understanding, configuring, and running the pairs trading backtest system. ## Overview The system is designed to backtest pairs trading strategies on historical market data. It allows users to select different strategies, configure parameters, and analyze the performance of these strategies. ## Core Concepts ### Trading Pair A trading pair consists of two financial instruments (e.g., stocks or cryptocurrencies) whose prices are believed to have a long-term statistical relationship (cointegration). The strategy aims to profit from temporary deviations from this relationship. ### Strategy The system supports different strategies for identifying and exploiting trading opportunities. Each strategy has its own set of configurable parameters. ### Trading Signals Trading signals indicate when to open or close a position based on the configured strategy and parameters. These signals are typically generated when the "dis-equilibrium" (the deviation from the long-term relationship) crosses certain thresholds. ## Running a Backtest ### 1. Configuration The primary configuration for the backtest is managed in the `src/pt_backtest.py` file. Here, you will define which dataset to use (cryptocurrencies or equities) and which strategy to employ. #### Choosing a Dataset: You can switch between `CRYPTO_CONFIG` and `EQT_CONFIG` by uncommenting the desired configuration block: ```python # CONFIG = CRYPTO_CONFIG # For cryptocurrency data CONFIG = EQT_CONFIG # For equity data ``` Each configuration dictionary specifies: - `data_directory`: Path to the data files. - `datafiles`: A list of database files to process. You can comment/uncomment specific files to include/exclude them from the backtest. - `db_table_name`: The name of the table within the SQLite database. - `instruments`: A list of symbols to consider for forming trading pairs. - `trading_hours`: Defines the session start and end times, crucial for equity markets. - `stat_model_price`: The column in the data to be used as the price (e.g., "close"). - `dis-equilibrium_open_trshld`: The threshold (in standard deviations) of the dis-equilibrium for opening a trade. - `dis-equilibrium_close_trshld`: The threshold (in standard deviations) of the dis-equilibrium for closing an open trade. - `training_minutes`: The length of the rolling window (in minutes) used to train the model (e.g., calculate cointegration, mean, and standard deviation of the dis-equilibrium). - `funding_per_pair`: The amount of capital allocated to each trading pair. #### Choosing a Strategy: The system currently offers two main strategies: `StaticFitStrategy` and `SlidingFitStrategy`. You select a strategy by instantiating it: ```python # STRATEGY = StaticFitStrategy() STRATEGY = SlidingFitStrategy() ``` - **`StaticFitStrategy`**: This strategy fits the cointegration model once at the beginning of each trading day (or for the entire dataset if run on a single file without a rolling window logic in the strategy itself). The parameters (mean, standard deviation of dis-equilibrium) derived from this initial fit are used for generating trading signals throughout the day. - **Pros**: Simpler, computationally less intensive. - **Cons**: May not adapt well to changing market conditions during the day. - **`SlidingFitStrategy`**: This strategy uses a rolling window approach. The cointegration model and its parameters are re-estimated at regular intervals (defined by `training_minutes` and how the strategy implements the sliding window). This allows the strategy to adapt to evolving market dynamics. - **Pros**: More adaptive to changing market conditions. - **Cons**: Computationally more intensive. The `training_minutes` parameter is crucial here as it defines the look-back period for each re-estimation. ### 2. Parameters for Trading Signals The key parameters that determine trading signals are primarily found within the `CONFIG` dictionaries: - **`dis-equilibrium_open_trshld`**: This is the number of standard deviations the current dis-equilibrium must move away from its mean (calculated during the training period) to trigger an opening signal. - A *higher* value means the strategy will wait for a more significant deviation before entering a trade, leading to fewer but potentially more robust signals. - A *lower* value means the strategy will enter trades on smaller deviations, leading to more frequent signals but potentially more false positives. - **`dis-equilibrium_close_trshld`**: This is the number of standard deviations the current dis-equilibrium must revert towards its mean (from its peak deviation) to trigger a closing signal. - A *higher* value (closer to the `dis-equilibrium_open_trshld`) means the strategy will close trades more quickly as the dis-equilibrium starts to revert. - A *lower* value (closer to zero) means the strategy will hold onto trades longer, waiting for the dis-equilibrium to revert more significantly towards the mean. - **`training_minutes`**: - For `StaticFitStrategy`, this determines the initial period of data used to establish the cointegration relationship and calculate the baseline dis-equilibrium statistics for the entire trading day (or dataset portion being processed). - For `SlidingFitStrategy`, this defines the length of the rolling window. The model is refit using data from the most recent `training_minutes` period. A shorter window makes the strategy more responsive to recent price action but might be more prone to noise. A longer window provides a more stable model but might be slower to adapt to new trends. ### 3. Running the Script Once the configuration is set, you can run the backtest from your terminal: ```bash python src/pt_backtest.py ``` The script will process each datafile specified in the `CONFIG`, create all possible unique pairs from the `instruments` list, and apply the chosen strategy. ### 4. Interpreting Results The script will output: - Progress messages for each datafile being processed. - A summary of trades taken. - Grand totals of performance metrics (PnL, etc.). - A list of any outstanding positions at the end of the backtest. The core logic for a pair involves: 1. **Data Preparation**: For each pair, relevant price series are extracted. 2. **Training Phase** (for `SlidingFitStrategy`, this happens repeatedly; for `StaticFitStrategy`, typically once per day/file): * The `get_datasets()` method in `TradingPair` splits data into training and testing sets. * `check_cointegration()` uses the Johansen test to see if the pair's price series are cointegrated within the current training window. If not, the pair is often skipped for that window. * If cointegrated, `fit_VECM()` estimates a Vector Error Correction Model (VECM). The `beta` coefficients from this model define the cointegrating relationship (the "spread" or "dis-equilibrium series"). * `training_mu_` (mean) and `training_std_` (standard deviation) of this dis-equilibrium series are calculated. These are crucial for scaling the dis-equilibrium and setting trade thresholds. 3. **Prediction/Trading Phase**: * The strategy iterates through the "testing" data points. * For each point, the current dis-equilibrium is calculated using the `beta` from the VECM. * This dis-equilibrium is then scaled: `(current_disequilibrium - training_mu_) / training_std_`. * This scaled value is compared against `dis-equilibrium_open_trshld` and `dis-equilibrium_close_trshld` to generate buy/sell/close signals. ## Customizing and Extending - **Adding New Strategies**: Create a new class that inherits from a base strategy class (if one exists) or implements a similar interface to `StaticFitStrategy` or `SlidingFitStrategy`. The core method to implement would be `run_pair()`. - **Modifying Data Loading**: The `tools/data_loader.py` can be modified to support different data formats or sources. - **Changing Cointegration/Model Parameters**: The `TradingPair` class houses the VECM fitting and cointegration checks. You can adjust parameters like `k_ar_diff` in `coint_johansen` or the `VECM` model itself. ## Important Considerations - **Data Quality**: Ensure your market data is clean, accurate, and properly formatted. Gaps or errors in data can significantly impact backtest results. - **Transaction Costs**: The current backtest might not explicitly model transaction costs (brokerage fees, slippage). These can have a significant impact on the profitability of high-frequency strategies. Consider adding a cost model to `BacktestResult` or within the strategy execution. - **Look-ahead Bias**: Be extremely careful to avoid look-ahead bias. Ensure that decisions at any point in time are made using only information that would have been available at that time. The use of `training_df_` and `testing_df_` in `TradingPair` is designed to help prevent this. - **Overfitting**: When optimizing parameters (`dis-equilibrium_open_trshld`, `training_minutes`, etc.), be mindful of overfitting to the historical data. A strategy that performs exceptionally well on past data may not perform well in the future. Use out-of-sample testing or walk-forward optimization for more robust validation. This tutorial should provide a solid foundation for working with the pairs trading backtest system. Experiment with different configurations and strategies to find what works best for your chosen markets and instruments.