pairs_trading/README.md

# Pairs Trading Backtest

This document provides a guide to understanding, configuring, and running the pairs trading backtest system.

## Overview

The system is designed to backtest pairs trading strategies on historical market data.
It allows users to select different strategies, configure parameters, and analyze the
performance of these strategies.

## Core Concepts

### Trading Pair
A trading pair consists of two financial instruments (e.g., stocks or cryptocurrencies)
whose prices are believed to have a long-term statistical relationship (cointegration).
The strategy aims to profit from temporary deviations from this relationship.

### Strategy
The system supports different strategies for identifying and exploiting trading opportunities. Each strategy has its own set of configurable parameters.

### Trading Signals
Trading signals indicate when to open or close a position based on the configured strategy
and parameters. These signals are typically generated when the "dis-equilibrium" (the
deviation from the long-term relationship) crosses certain thresholds.

## Running a Backtest

### 1. Configuration

The primary configuration for the backtest is managed in the `src/pt_backtest.py` file. Here, you will define which dataset to use (cryptocurrencies or equities) and which strategy to employ.

#### Choosing a Dataset:
You can switch between `CRYPTO_CONFIG` and `EQT_CONFIG` by uncommenting the desired configuration block:

```python
# CONFIG = CRYPTO_CONFIG  # For cryptocurrency data
CONFIG = EQT_CONFIG      # For equity data
```

Each configuration dictionary specifies:
- `data_directory`: Path to the data files.
- `datafiles`: A list of database files to process. You can comment/uncomment specific files to include/exclude them from the backtest.
- `db_table_name`: The name of the table within the SQLite database.
- `instruments`: A list of symbols to consider for forming trading pairs.
- `trading_hours`: Defines the session start and end times, crucial for equity markets.
- `stat_model_price`: The column in the data to be used as the price (e.g., "close").
- `dis-equilibrium_open_trshld`: The threshold (in standard deviations) of the dis-equilibrium for opening a trade.
- `dis-equilibrium_close_trshld`: The threshold (in standard deviations) of the dis-equilibrium for closing an open trade.
- `training_minutes`: The length of the rolling window (in minutes) used to train the model (e.g., calculate cointegration, mean, and standard deviation of the dis-equilibrium).
- `funding_per_pair`: The amount of capital allocated to each trading pair.

#### Choosing a Strategy:
The system currently offers two main strategies: `StaticFitStrategy` and `SlidingFitStrategy`. You select a strategy by instantiating it:

```python
# STRATEGY = StaticFitStrategy()
STRATEGY = SlidingFitStrategy()
```

- **`StaticFitStrategy`**: This strategy fits the cointegration model once at the beginning
  of each trading day (or for the entire dataset if run on a single file without a rolling
  window logic in the strategy itself). The parameters (mean, standard deviation of
  dis-equilibrium) derived from this initial fit are used for generating trading signals
  throughout the day.
    - **Pros**: Simpler, computationally less intensive.
    - **Cons**: May not adapt well to changing market conditions during the day.

- **`SlidingFitStrategy`**: This strategy uses a rolling window approach. The cointegration model and its parameters are re-estimated at regular intervals (defined by `training_minutes` and how the strategy implements the sliding window). This allows the strategy to adapt to evolving market dynamics.
    - **Pros**: More adaptive to changing market conditions.
    - **Cons**: Computationally more intensive. The `training_minutes` parameter is crucial here as it defines the look-back period for each re-estimation.

### 2. Parameters for Trading Signals

The key parameters that determine trading signals are primarily found within the `CONFIG` dictionaries:

- **`dis-equilibrium_open_trshld`**: This is the number of standard deviations the current dis-equilibrium must move away from its mean (calculated during the training period) to trigger an opening signal.
    - A *higher* value means the strategy will wait for a more significant deviation before entering a trade, leading to fewer but potentially more robust signals.
    - A *lower* value means the strategy will enter trades on smaller deviations, leading to more frequent signals but potentially more false positives.

- **`dis-equilibrium_close_trshld`**: This is the number of standard deviations the current dis-equilibrium must revert towards its mean (from its peak deviation) to trigger a closing signal.
    - A *higher* value (closer to the `dis-equilibrium_open_trshld`) means the strategy will close trades more quickly as the dis-equilibrium starts to revert.
    - A *lower* value (closer to zero) means the strategy will hold onto trades longer, waiting for the dis-equilibrium to revert more significantly towards the mean.

- **`training_minutes`**:
    - For `StaticFitStrategy`, this determines the initial period of data used to establish the cointegration relationship and calculate the baseline dis-equilibrium statistics for the entire trading day (or dataset portion being processed).
    - For `SlidingFitStrategy`, this defines the length of the rolling window. The model is refit using data from the most recent `training_minutes` period. A shorter window makes the strategy more responsive to recent price action but might be more prone to noise. A longer window provides a more stable model but might be slower to adapt to new trends.

### 3. Running the Script

Once the configuration is set, you can run the backtest from your terminal:

```bash
python src/pt_backtest.py
```

The script will process each datafile specified in the `CONFIG`, create all possible unique pairs from the `instruments` list, and apply the chosen strategy.

### 4. Interpreting Results

The script will output:
- Progress messages for each datafile being processed.
- A summary of trades taken.
- Grand totals of performance metrics (PnL, etc.).
- A list of any outstanding positions at the end of the backtest.

The core logic for a pair involves:
1.  **Data Preparation**: For each pair, relevant price series are extracted.
2.  **Training Phase** (for `SlidingFitStrategy`, this happens repeatedly; for `StaticFitStrategy`, typically once per day/file):
    *   The `get_datasets()` method in `TradingPair` splits data into training and testing sets.
    *   `check_cointegration()` uses the Johansen test to see if the pair's price series are cointegrated within the current training window. If not, the pair is often skipped for that window.
    *   If cointegrated, `fit_VECM()` estimates a Vector Error Correction Model (VECM). The `beta` coefficients from this model define the cointegrating relationship (the "spread" or "dis-equilibrium series").
    *   `training_mu_` (mean) and `training_std_` (standard deviation) of this dis-equilibrium series are calculated. These are crucial for scaling the dis-equilibrium and setting trade thresholds.
3.  **Prediction/Trading Phase**:
    *   The strategy iterates through the "testing" data points.
    *   For each point, the current dis-equilibrium is calculated using the `beta` from the VECM.
    *   This dis-equilibrium is then scaled: `(current_disequilibrium - training_mu_) / training_std_`.
    *   This scaled value is compared against `dis-equilibrium_open_trshld` and `dis-equilibrium_close_trshld` to generate buy/sell/close signals.

## Customizing and Extending

-   **Adding New Strategies**: Create a new class that inherits from a base strategy class (if one exists) or implements a similar interface to `StaticFitStrategy` or `SlidingFitStrategy`. The core method to implement would be `run_pair()`.
-   **Modifying Data Loading**: The `tools/data_loader.py` can be modified to support different data formats or sources.
-   **Changing Cointegration/Model Parameters**: The `TradingPair` class houses the VECM fitting and cointegration checks. You can adjust parameters like `k_ar_diff` in `coint_johansen` or the `VECM` model itself.

## Important Considerations

-   **Data Quality**: Ensure your market data is clean, accurate, and properly formatted. Gaps or errors in data can significantly impact backtest results.
-   **Transaction Costs**: The current backtest might not explicitly model transaction costs (brokerage fees, slippage). These can have a significant impact on the profitability of high-frequency strategies. Consider adding a cost model to `BacktestResult` or within the strategy execution.
-   **Look-ahead Bias**: Be extremely careful to avoid look-ahead bias. Ensure that decisions at any point in time are made using only information that would have been available at that time. The use of `training_df_` and `testing_df_` in `TradingPair` is designed to help prevent this.
-   **Overfitting**: When optimizing parameters (`dis-equilibrium_open_trshld`, `training_minutes`, etc.), be mindful of overfitting to the historical data. A strategy that performs exceptionally well on past data may not perform well in the future. Use out-of-sample testing or walk-forward optimization for more robust validation.

This tutorial should provide a solid foundation for working with the pairs trading backtest system. Experiment with different configurations and strategies to find what works best for your chosen markets and instruments.