1. Nested Cross-Validation (for GRU Hyperparameter Tuning) Goal: To tune GRU hyperparameters (like gru_units, learning_rate, etc.) robustly for each main walk-forward fold, using only the training data allocated to that fold. This prevents hyperparameters from being influenced by data that will later appear in the fold's validation or test set. Current Implementation: The hyperparameter_tuning.gru.sweep_enabled flag exists, but the tuning logic isn't currently nested within the fold processing loop in train_or_load_gru_fold. Implementation Strategy: Modify train_or_load_gru_fold (in gru_sac_predictor/src/pipeline_stages/modelling.py): This is the function responsible for training or loading the GRU for a specific outer walk-forward fold. Check sweep_enabled: Inside this function, right before the actual GRU training would normally occur (i.e., if config['gru']['train_gru'] is true and a model isn't being loaded), check if config['hyperparameter_tuning']['gru']['sweep_enabled'] is also true. Inner CV Loop: If sweep is enabled: Data: Use the X_train_seq and y_train_seq_dict passed into this function (these represent the training data for the current outer fold). Inner Splits: Use a time-series-appropriate splitter (like sklearn.model_selection.TimeSeriesSplit) on the sequence indices (train_indices_new if returned, otherwise derive from X_train_seq) to create, say, 3 or 5 inner train/validation splits within the outer fold's training data. Optuna Study: Create a new Optuna study (or similar hyperparameter optimization framework) specific to this outer fold. Objective Function: Define an Optuna objective function that takes a trial object: It suggests hyperparameters based on config['hyperparameter_tuning']['gru']['search_space']. It iterates through the inner CV splits. For each inner split: Instantiate a temporary GRUModelHandler (or just the model) with the trial's hyperparameters. Train the model on the inner training data slice. Evaluate it on the inner validation data slice (e.g., calculate val_loss). Return the average performance (e.g., average val_loss) across the inner splits. Run Study: Execute study.optimize with the objective function and n_trials from the config. Best Parameters: Retrieve the study.best_params after optimization. Final Fold Training: Instantiate the GRUModelHandler (gru_handler passed into the function) or build the GRU model using these best_params. Train this single, final model for the outer fold on the entire X_train_seq and y_train_seq_dict. Return: Return this optimally tuned GRU model and handler for the outer fold to proceed. Configuration: The existing hyperparameter_tuning.gru section is mostly sufficient. You might add a key like inner_cv_splits: 3 to control the inner loop. Considerations: This significantly increases computation time, as n_trials * inner_cv_splits models are trained per outer fold. 2. Gap and Regime-Aware Folds Here’s a minimal “wrapper” you can drop around your existing `_generate_walk_forward_folds` to get both gap‑aware **and** regime‑aware filtering, without rewriting your core logic: ```python def generate_filtered_folds(df, config): # 1) Tag regimes once, right after loading & feature‐engineering the full dataset if config['walk_forward']['regime']['enabled']: df = add_regime_tags( df, indicator=config['walk_forward']['regime']['indicator'], window=config['walk_forward']['regime']['indicator_params']['window'], quantiles=config['walk_forward']['regime']['quantiles'] ) min_pct = config['walk_forward']['regime']['min_regime_representation_pct'] # 2) Split into contiguous chunks on data gaps chunks = split_into_contiguous_chunks( df, config['walk_forward']['gap_threshold_minutes'] ) # 3) For each chunk, run your normal fold‐generator, then filter by regime for chunk_start, chunk_end in chunks: df_chunk = df.loc[chunk_start:chunk_end] # skip tiny chunks if (chunk_end - chunk_start).days < config['walk_forward'].get('min_chunk_days', 1): continue # your existing generator (rolling or block)— # it yields tuples of (train_start, train_end, val_start, val_end, test_start, test_end) for (t0, t1, v0, v1, e0, e1) in self._original_fold_generator(df_chunk, config): # if regime gating is off, just yield if not config['walk_forward']['regime']['enabled']: yield (t0, t1, v0, v1, e0, e1) continue # 4) Check regime balance in each period periods = { 'train': df_chunk.loc[t0:t1], 'val': df_chunk.loc[v0:v1], 'test': df_chunk.loc[e0:e1], } bad = False for name, subdf in periods.items(): counts = subdf['regime_tag'].value_counts(normalize=True) * 100 # ensure every regime appears ≥ min_pct for regime in sorted(df['regime_tag'].unique()): pct = counts.get(regime, 0.0) if pct < min_pct: bad = True break if bad: break if bad: # you can log which period/regime failed here continue # otherwise it’s a valid fold yield (t0, t1, v0, v1, e0, e1) ``` ### Explanation of the steps 1. **Regime Tagging** - Run once, up‑front: compute your volatility or trend indicator over the full series, cut it into quantile bins, and assign each row a `regime_tag` of 0/1/2. 2. **Gap Partitioning** - Split the DataFrame into contiguous “chunks” wherever index gaps exceed your `gap_threshold_minutes`. - This avoids forcing folds that straddle a hole in the data. 3. **Fold Generation (Unchanged)** - Call your existing `_generate_walk_forward_folds` (rolling or block) on each contiguous chunk. 4. **Regime‐Balance Filter** - For each candidate fold, slice out the train/val/test segments, compute the fraction of each regime tag, and **skip** any fold where any regime appears below your `min_regime_representation_pct`. --- #### Configuration sketch ```yaml walk_forward: # existing fields… gap_threshold_minutes: 5 regime: enabled: true indicator: volatility indicator_params: window: 20 quantiles: [0.33, 0.66] min_regime_representation_pct: 10 ``` With this wrapper, you get: - **Automatic split** at data outages > 5 min - **Dynamic skip** of any time‐slice folds that would be blind to a market regime (e.g. all high‑vol or all low‑vol) - **No changes** to your core split logic—just filter its outputs.