124 lines
6.7 KiB
Plaintext
124 lines
6.7 KiB
Plaintext
1. Nested Cross-Validation (for GRU Hyperparameter Tuning)
|
||
Goal: To tune GRU hyperparameters (like gru_units, learning_rate, etc.) robustly for each main walk-forward fold, using only the training data allocated to that fold. This prevents hyperparameters from being influenced by data that will later appear in the fold's validation or test set.
|
||
Current Implementation: The hyperparameter_tuning.gru.sweep_enabled flag exists, but the tuning logic isn't currently nested within the fold processing loop in train_or_load_gru_fold.
|
||
Implementation Strategy:
|
||
Modify train_or_load_gru_fold (in gru_sac_predictor/src/pipeline_stages/modelling.py): This is the function responsible for training or loading the GRU for a specific outer walk-forward fold.
|
||
Check sweep_enabled: Inside this function, right before the actual GRU training would normally occur (i.e., if config['gru']['train_gru'] is true and a model isn't being loaded), check if config['hyperparameter_tuning']['gru']['sweep_enabled'] is also true.
|
||
Inner CV Loop: If sweep is enabled:
|
||
Data: Use the X_train_seq and y_train_seq_dict passed into this function (these represent the training data for the current outer fold).
|
||
Inner Splits: Use a time-series-appropriate splitter (like sklearn.model_selection.TimeSeriesSplit) on the sequence indices (train_indices_new if returned, otherwise derive from X_train_seq) to create, say, 3 or 5 inner train/validation splits within the outer fold's training data.
|
||
Optuna Study: Create a new Optuna study (or similar hyperparameter optimization framework) specific to this outer fold.
|
||
Objective Function: Define an Optuna objective function that takes a trial object:
|
||
It suggests hyperparameters based on config['hyperparameter_tuning']['gru']['search_space'].
|
||
It iterates through the inner CV splits. For each inner split:
|
||
Instantiate a temporary GRUModelHandler (or just the model) with the trial's hyperparameters.
|
||
Train the model on the inner training data slice.
|
||
Evaluate it on the inner validation data slice (e.g., calculate val_loss).
|
||
Return the average performance (e.g., average val_loss) across the inner splits.
|
||
Run Study: Execute study.optimize with the objective function and n_trials from the config.
|
||
Best Parameters: Retrieve the study.best_params after optimization.
|
||
Final Fold Training: Instantiate the GRUModelHandler (gru_handler passed into the function) or build the GRU model using these best_params. Train this single, final model for the outer fold on the entire X_train_seq and y_train_seq_dict.
|
||
Return: Return this optimally tuned GRU model and handler for the outer fold to proceed.
|
||
Configuration:
|
||
The existing hyperparameter_tuning.gru section is mostly sufficient.
|
||
You might add a key like inner_cv_splits: 3 to control the inner loop.
|
||
Considerations: This significantly increases computation time, as n_trials * inner_cv_splits models are trained per outer fold.
|
||
|
||
2. Gap and Regime-Aware Folds
|
||
Here’s a minimal “wrapper” you can drop around your existing `_generate_walk_forward_folds` to get both gap‑aware **and** regime‑aware filtering, without rewriting your core logic:
|
||
|
||
```python
|
||
def generate_filtered_folds(df, config):
|
||
# 1) Tag regimes once, right after loading & feature‐engineering the full dataset
|
||
if config['walk_forward']['regime']['enabled']:
|
||
df = add_regime_tags(
|
||
df,
|
||
indicator=config['walk_forward']['regime']['indicator'],
|
||
window=config['walk_forward']['regime']['indicator_params']['window'],
|
||
quantiles=config['walk_forward']['regime']['quantiles']
|
||
)
|
||
min_pct = config['walk_forward']['regime']['min_regime_representation_pct']
|
||
|
||
# 2) Split into contiguous chunks on data gaps
|
||
chunks = split_into_contiguous_chunks(
|
||
df,
|
||
config['walk_forward']['gap_threshold_minutes']
|
||
)
|
||
|
||
# 3) For each chunk, run your normal fold‐generator, then filter by regime
|
||
for chunk_start, chunk_end in chunks:
|
||
df_chunk = df.loc[chunk_start:chunk_end]
|
||
# skip tiny chunks
|
||
if (chunk_end - chunk_start).days < config['walk_forward'].get('min_chunk_days', 1):
|
||
continue
|
||
|
||
# your existing generator (rolling or block)—
|
||
# it yields tuples of (train_start, train_end, val_start, val_end, test_start, test_end)
|
||
for (t0, t1, v0, v1, e0, e1) in self._original_fold_generator(df_chunk, config):
|
||
# if regime gating is off, just yield
|
||
if not config['walk_forward']['regime']['enabled']:
|
||
yield (t0, t1, v0, v1, e0, e1)
|
||
continue
|
||
|
||
# 4) Check regime balance in each period
|
||
periods = {
|
||
'train': df_chunk.loc[t0:t1],
|
||
'val': df_chunk.loc[v0:v1],
|
||
'test': df_chunk.loc[e0:e1],
|
||
}
|
||
bad = False
|
||
for name, subdf in periods.items():
|
||
counts = subdf['regime_tag'].value_counts(normalize=True) * 100
|
||
# ensure every regime appears ≥ min_pct
|
||
for regime in sorted(df['regime_tag'].unique()):
|
||
pct = counts.get(regime, 0.0)
|
||
if pct < min_pct:
|
||
bad = True
|
||
break
|
||
if bad:
|
||
break
|
||
|
||
if bad:
|
||
# you can log which period/regime failed here
|
||
continue
|
||
# otherwise it’s a valid fold
|
||
yield (t0, t1, v0, v1, e0, e1)
|
||
```
|
||
|
||
### Explanation of the steps
|
||
|
||
1. **Regime Tagging**
|
||
- Run once, up‑front: compute your volatility or trend indicator over the full series, cut it into quantile bins, and assign each row a `regime_tag` of 0/1/2.
|
||
|
||
2. **Gap Partitioning**
|
||
- Split the DataFrame into contiguous “chunks” wherever index gaps exceed your `gap_threshold_minutes`.
|
||
- This avoids forcing folds that straddle a hole in the data.
|
||
|
||
3. **Fold Generation (Unchanged)**
|
||
- Call your existing `_generate_walk_forward_folds` (rolling or block) on each contiguous chunk.
|
||
|
||
4. **Regime‐Balance Filter**
|
||
- For each candidate fold, slice out the train/val/test segments, compute the fraction of each regime tag, and **skip** any fold where any regime appears below your `min_regime_representation_pct`.
|
||
|
||
---
|
||
|
||
#### Configuration sketch
|
||
|
||
```yaml
|
||
walk_forward:
|
||
# existing fields…
|
||
gap_threshold_minutes: 5
|
||
regime:
|
||
enabled: true
|
||
indicator: volatility
|
||
indicator_params:
|
||
window: 20
|
||
quantiles: [0.33, 0.66]
|
||
min_regime_representation_pct: 10
|
||
```
|
||
|
||
With this wrapper, you get:
|
||
|
||
- **Automatic split** at data outages > 5 min
|
||
- **Dynamic skip** of any time‐slice folds that would be blind to a market regime (e.g. all high‑vol or all low‑vol)
|
||
- **No changes** to your core split logic—just filter its outputs. |