gru_sac_predictor/prompts/walk_forward.txt
2025-04-20 17:52:49 +00:00

124 lines
6.7 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

1. Nested Cross-Validation (for GRU Hyperparameter Tuning)
Goal: To tune GRU hyperparameters (like gru_units, learning_rate, etc.) robustly for each main walk-forward fold, using only the training data allocated to that fold. This prevents hyperparameters from being influenced by data that will later appear in the fold's validation or test set.
Current Implementation: The hyperparameter_tuning.gru.sweep_enabled flag exists, but the tuning logic isn't currently nested within the fold processing loop in train_or_load_gru_fold.
Implementation Strategy:
Modify train_or_load_gru_fold (in gru_sac_predictor/src/pipeline_stages/modelling.py): This is the function responsible for training or loading the GRU for a specific outer walk-forward fold.
Check sweep_enabled: Inside this function, right before the actual GRU training would normally occur (i.e., if config['gru']['train_gru'] is true and a model isn't being loaded), check if config['hyperparameter_tuning']['gru']['sweep_enabled'] is also true.
Inner CV Loop: If sweep is enabled:
Data: Use the X_train_seq and y_train_seq_dict passed into this function (these represent the training data for the current outer fold).
Inner Splits: Use a time-series-appropriate splitter (like sklearn.model_selection.TimeSeriesSplit) on the sequence indices (train_indices_new if returned, otherwise derive from X_train_seq) to create, say, 3 or 5 inner train/validation splits within the outer fold's training data.
Optuna Study: Create a new Optuna study (or similar hyperparameter optimization framework) specific to this outer fold.
Objective Function: Define an Optuna objective function that takes a trial object:
It suggests hyperparameters based on config['hyperparameter_tuning']['gru']['search_space'].
It iterates through the inner CV splits. For each inner split:
Instantiate a temporary GRUModelHandler (or just the model) with the trial's hyperparameters.
Train the model on the inner training data slice.
Evaluate it on the inner validation data slice (e.g., calculate val_loss).
Return the average performance (e.g., average val_loss) across the inner splits.
Run Study: Execute study.optimize with the objective function and n_trials from the config.
Best Parameters: Retrieve the study.best_params after optimization.
Final Fold Training: Instantiate the GRUModelHandler (gru_handler passed into the function) or build the GRU model using these best_params. Train this single, final model for the outer fold on the entire X_train_seq and y_train_seq_dict.
Return: Return this optimally tuned GRU model and handler for the outer fold to proceed.
Configuration:
The existing hyperparameter_tuning.gru section is mostly sufficient.
You might add a key like inner_cv_splits: 3 to control the inner loop.
Considerations: This significantly increases computation time, as n_trials * inner_cv_splits models are trained per outer fold.
2. Gap and Regime-Aware Folds
Heres a minimal “wrapper” you can drop around your existing `_generate_walk_forward_folds` to get both gapaware **and** regimeaware filtering, without rewriting your core logic:
```python
def generate_filtered_folds(df, config):
# 1) Tag regimes once, right after loading & featureengineering the full dataset
if config['walk_forward']['regime']['enabled']:
df = add_regime_tags(
df,
indicator=config['walk_forward']['regime']['indicator'],
window=config['walk_forward']['regime']['indicator_params']['window'],
quantiles=config['walk_forward']['regime']['quantiles']
)
min_pct = config['walk_forward']['regime']['min_regime_representation_pct']
# 2) Split into contiguous chunks on data gaps
chunks = split_into_contiguous_chunks(
df,
config['walk_forward']['gap_threshold_minutes']
)
# 3) For each chunk, run your normal foldgenerator, then filter by regime
for chunk_start, chunk_end in chunks:
df_chunk = df.loc[chunk_start:chunk_end]
# skip tiny chunks
if (chunk_end - chunk_start).days < config['walk_forward'].get('min_chunk_days', 1):
continue
# your existing generator (rolling or block)—
# it yields tuples of (train_start, train_end, val_start, val_end, test_start, test_end)
for (t0, t1, v0, v1, e0, e1) in self._original_fold_generator(df_chunk, config):
# if regime gating is off, just yield
if not config['walk_forward']['regime']['enabled']:
yield (t0, t1, v0, v1, e0, e1)
continue
# 4) Check regime balance in each period
periods = {
'train': df_chunk.loc[t0:t1],
'val': df_chunk.loc[v0:v1],
'test': df_chunk.loc[e0:e1],
}
bad = False
for name, subdf in periods.items():
counts = subdf['regime_tag'].value_counts(normalize=True) * 100
# ensure every regime appears ≥ min_pct
for regime in sorted(df['regime_tag'].unique()):
pct = counts.get(regime, 0.0)
if pct < min_pct:
bad = True
break
if bad:
break
if bad:
# you can log which period/regime failed here
continue
# otherwise its a valid fold
yield (t0, t1, v0, v1, e0, e1)
```
### Explanation of the steps
1. **Regime Tagging**
- Run once, upfront: compute your volatility or trend indicator over the full series, cut it into quantile bins, and assign each row a `regime_tag` of 0/1/2.
2. **Gap Partitioning**
- Split the DataFrame into contiguous “chunks” wherever index gaps exceed your `gap_threshold_minutes`.
- This avoids forcing folds that straddle a hole in the data.
3. **Fold Generation (Unchanged)**
- Call your existing `_generate_walk_forward_folds` (rolling or block) on each contiguous chunk.
4. **RegimeBalance Filter**
- For each candidate fold, slice out the train/val/test segments, compute the fraction of each regime tag, and **skip** any fold where any regime appears below your `min_regime_representation_pct`.
---
#### Configuration sketch
```yaml
walk_forward:
# existing fields…
gap_threshold_minutes: 5
regime:
enabled: true
indicator: volatility
indicator_params:
window: 20
quantiles: [0.33, 0.66]
min_regime_representation_pct: 10
```
With this wrapper, you get:
- **Automatic split** at data outages > 5 min
- **Dynamic skip** of any timeslice folds that would be blind to a market regime (e.g. all highvol or all lowvol)
- **No changes** to your core split logic—just filter its outputs.