Double Descent in Financial Time Series

Nicholas Wong

Final project for 6.7960 Deep Learning, MIT

Introduction

Understanding generalization in modern deep learning systems remains an important open problem. Over the past several years, a growing body of empirical work has documented a surprising pattern: increasing model complexity can improve test performance even after models fully interpolate the training data. This behavior, commonly referred to as the double descent phenomenon, has been observed in linear models, random feature models, and deep neural networks trained on computer vision and language tasks.

The typical pattern is as follows. Test error decreases as model size grows, consistent with classical statistical learning theory. However, near the interpolation threshold where the model has just enough capacity to fit all training examples, test error increases sharply. Beyond this point, as model size continues to grow, test error decreases again in the overparameterized regime. This contradicts decades of statistical wisdom about the bias-variance tradeoff, which predicts that overparameterized models should overfit and generalize poorly.

Financial time series present a markedly different setting. Asset returns exhibit extremely low signal-to-noise ratios, with even the best predictive models explaining only 5–10% of variance. The data-generating process is non-stationary, characterized by regime shifts, structural breaks, and time-varying conditional distributions. Unlike benchmark datasets in computer vision where labels are essentially deterministic given the input (a photo of a cat is unambiguously a cat), next-period returns are fundamentally noisy realizations rather than deterministic functions of observable features.

If double descent requires sufficient signal strength and stable data structure to manifest, we should not expect to observe it in financial prediction tasks. This raises a practical question for quantitative practitioners: do the benefits of overparameterization observed in high-signal domains extend to low-signal environments, or does classical regularization intuition remain the appropriate modeling approach?

Existing work has characterized double descent on clean benchmark datasets and has reported its absence in financial prediction tasks, but these strands of research have largely been studied in isolation. Prior studies do not systematically vary signal-to-noise ratio while holding the data-generating process fixed, nor do they directly compare synthetic and real financial data under a shared experimental protocol. This project addresses that gap by combining controlled synthetic experiments with a matched financial study in order to identify when the interpolation regime is beneficial and when it fails in practice.

Research hypotheses

H1 (Validation): Double descent will be observable in synthetic time series when the signal-to-noise ratio is sufficiently high and the data-generating process is stationary and learnable.

H2 (Mechanism): The phenomenon will attenuate and eventually disappear as the signal-to-noise ratio decreases through systematic label noise manipulation.

H3 (Main result): Real S&P 500 returns, characterized by inherently low signal-to-noise ratios and non-stationarity, will exhibit flat or absent double descent patterns even in highly overparameterized models.

To evaluate these hypotheses, we design experiments that systematically vary signal-to-noise ratio in controlled synthetic data (5 noise levels, 8 model sizes, 5 random seeds, yielding 200 training runs) and compare these results with analogous experiments on 10 years of S&P 500 constituent data. The experimental design enables a direct assessment of how noise, structure, and model capacity interact to determine generalization behavior.

Our findings strongly support all three hypotheses. Double descent requires sufficient signal-to-noise ratio to manifest. When irreducible noise dominates signal, as it does in financial return prediction, the characteristic interpolation peak flattens or vanishes entirely. This has immediate implications for modeling practice: in low-signal-to-noise domains, the "bigger is better" intuition from modern deep learning does not apply, and classical model selection based on validation performance remains essential.

The remainder of this work proceeds as follows. We first reproduce double descent on clean synthetic data to validate our experimental pipeline. We then systematically degrade signal quality to characterize how the phenomenon attenuates with increasing noise. Finally, we test whether real financial data exhibits any trace of double descent and interpret the results in light of the theoretical framework established by prior work. The empirical patterns observed provide clear evidence that double descent is contingent on data characteristics and offer practical guidance for model selection in low-signal domains.

Background and related work

Empirical characterization of double descent

Classical learning theory predicts a U-shaped test error curve as model complexity increases. Recent empirical work has shown that many modern models exhibit a different pattern. Belkin et al. demonstrated that test error can rise sharply at the interpolation threshold and then decrease again in the overparameterized regime [1]. Nakkiran et al. expanded this result and showed that double descent appears not only with model size but also with training time and dataset size [2]. These studies establish that overparameterization can improve generalization under certain conditions.

Theoretical analysis and conditions for benign overfitting

A complementary line of work provides theoretical explanations for when interpolation can generalize. Hastie et al. analyzed ridgeless regression and showed that risk can diverge near the interpolation threshold but decrease again for sufficiently large models [3]. Bartlett et al. characterized conditions under which interpolating solutions remain predictive, emphasizing the roles of effective rank and alignment between signal and noise [4]. Mei and Montanari derived precise asymptotics for random feature regression that highlight the dependence of test error on model width, data covariance structure, and signal strength [5]. Collectively, these results show that double descent requires both sufficient signal and a favorable geometric relationship between the data distribution and the model class.

Variance decomposition and the interpolation peak

Adlam and Pennington decomposed test error into contributions from label noise, sampling variation, and initialization, and showed that their interaction explains the sharp interpolation peak [6]. Their findings suggest that double descent is not driven by noise alone but arises from how model capacity interacts with the geometry of the data. When these conditions fail, the peak can weaken or disappear.

Double descent in time series and financial domains

Recent work has begun to examine double descent in sequential settings. Assandri et al. documented epoch-wise double descent in Transformer models trained on long-horizon forecasting tasks when temporal structure is strong and the underlying signal is rich enough to support complex sequence models [7]. Their analysis focuses on training dynamics and illustrates that, under favorable conditions, additional training and capacity can eventually improve generalization even after an apparent overfitting phase.

In contrast, studies of financial prediction have reported a very different picture. Noguer and Srivastava evaluated a variety of machine learning models on equity index prediction tasks and found no evidence of double descent in test performance curves, even though training error decreased with model complexity [8]. They attribute this absence to low signal-to-noise ratios and structural non-stationarity in financial markets, but do not systematically vary signal strength or directly compare financial data to clean synthetic benchmarks under a shared experimental protocol.

These works together suggest that double descent is sensitive to domain characteristics, but they leave several questions open. In particular, prior studies do not isolate the role of signal-to-noise ratio through controlled experiments, nor do they quantify how far financial prediction lies from the regimes where double descent is known to occur. Our project addresses this gap by combining controlled synthetic experiments with a matched financial study. By systematically degrading signal in a setting where double descent is known to appear and directly comparing these results to S&P 500 returns under an identical modeling pipeline, we expose a fundamental limitation of overparameterized deep models in low signal and non stationary environments and clarify when the modern interpolation regime is relevant for time series forecasting and when it fails to hold.

Methods and experiments

This section describes the data sources, model architectures, and experimental protocols used to evaluate the three research hypotheses. All experiments follow a common structure. We validate the presence of double descent in a clean synthetic setting, vary signal-to-noise ratio by adding controlled label noise, and then apply the same modeling pipeline to S&P 500 equity returns to assess whether the phenomenon persists in real financial data.

Data

Synthetic data with controlled signal-to-noise ratios

The synthetic dataset provides a stable, learnable nonlinear relationship with controlled noise levels. We generate feature vectors x ∈ ℝ²⁰ by sampling independently from a standard normal distribution, x ~ N(0, I). Targets are obtained by applying a smooth nonlinear function to x and then adding independent Gaussian label noise with standard deviation σ ∈ {0, 0.5, 1.0, 2.0, 4.0}. Specifically, y = f(x) + σε, where f is a deterministic nonlinear mapping and ε ~ N(0,1). When σ = 0 the mapping is deterministic; for larger σ values, noise increasingly dominates and the effective signal-to-noise ratio decreases.

For each noise level we generate 5,000 samples and apply a 70/15/15 percent train/validation/test split. Synthetic data serves two purposes: it verifies that the implementation reproduces classical double descent when signal is strong, and it isolates the effect of label noise on generalization without confounding by temporal dependencies or non-stationarity.

S&P 500 financial data

The financial dataset was constructed from scratch using daily price and volume data for S&P 500 constituents downloaded via the Yahoo Finance API (yfinance). We selected approximately 50 liquid, large-cap stocks spanning 2014 to 2024 and engineered a feature set designed to capture common technical trading signals. For each stock and date, we compute lagged returns (1, 5, and 20 day), 20-day rolling volatility, 20-day moving average of volume, and market return. Features are standardized cross-sectionally within each date to account for time-varying market conditions. The prediction target is the next-day return.

We use a chronological split of 2014–2021 for training, 2022 for validation, and 2023–2024 for testing. This produces a sample of roughly 50,000 stock-day observations for model fitting. Unlike the synthetic data, financial series are non-stationary and exhibit extremely low signal-to-noise ratio, making them a realistic stress test for generalization behavior.

Table 1 reports baseline performance for neural networks, ridge regression, and the naive mean predictor across synthetic and financial settings.

Dataset	Method	Configuration	Test MSE	R²
Synthetic (σ=0)	Neural Network	Width 16	0.025014	0.9875
Synthetic (σ=0)	Ridge Regression	20 parameters	0.257353	0.8718
Synthetic (σ=4)	Neural Network	Width 8	17.944953	-0.0343
Synthetic (σ=4)	Ridge Regression	20 parameters	15.902617	0.0834
S&P 500	Neural Network	Width 512	0.000286	-0.0020
S&P 500	Ridge Regression	7 parameters	0.000288	-0.0082
S&P 500	Naive Baseline	Mean prediction	0.000286	0.0000

Model architectures and training

All neural network models are fully connected multilayer perceptrons with ReLU activations. Depth is fixed at two hidden layers, and width varies across {8, 16, 32, 64, 128, 256, 512, 1024}, yielding models from roughly 10² to 10⁶ parameters and ensuring the interpolation threshold is crossed. The output dimension is one for regression.

Models are trained using the Adam optimizer with learning rate 0.001. For synthetic data we use full-batch gradient descent; for financial data we use mini-batches of size 256. Training continues until loss stabilizes for 50 epochs. No explicit regularization (dropout, weight decay, early stopping) is applied; this isolates the effect of implicit regularization and allows direct observation of the interpolation peak.

We report mean squared error and out-of-sample R². Ridge regression is included as a linear benchmark with cross-validated regularization strength. Table 2 summarizes interpolation peak statistics for each noise level and for the financial dataset.

Dataset	Peak Width	Peak MSE	Best MSE	Peak/Best Ratio	% Worse
Synthetic (σ=0)	128	0.219101	0.025014	8.76×	775.9%
Synthetic (σ=0.5)	64	0.948106	0.386650	2.45×	145.2%
Synthetic (σ=1.0)	64	3.151716	1.288296	2.45×	144.6%
Synthetic (σ=2.0)	64	11.237464	4.650125	2.42×	141.7%
Synthetic (σ=4.0)	64	43.852972	17.944953	2.44×	144.4%
S&P 500	64	0.000312	0.000286	1.09×	9.1%

Experimental protocols

We conduct three experiments corresponding to the three research hypotheses. All experiments are repeated over five random seeds.

Experiment 1: Synthetic high signal-to-noise ratio (H1)

For σ = 0 we train models at all widths and record training loss, test mean squared error, and R². This experiment verifies that the implementation reproduces the classical double descent curve under ideal conditions.

Experiment 2: Noise sweep (H2)

To quantify the effect of decreasing signal quality, we repeat the width sweep for each value of σ. We also track the performance of three representative models: a simple model (width 8), a threshold model (width 64), and a complex model (width 1024). This allows us to analyze how robustness and generalization change as noise increases.

Experiment 3: Financial prediction (H3)

Finally, we apply the same modeling protocol to S&P 500 returns. By holding the architecture and training procedure fixed across datasets, we can directly compare generalization patterns in clean synthetic data, noisy synthetic data, and real financial data.

Results

This section evaluates the three research hypotheses using the experimental protocols described above. We first confirm that the implementation reproduces double descent in a clean synthetic setting. We then examine how this behavior changes as noise increases and compare these results with the behavior observed in S&P 500 returns.

Experiment 1: Synthetic high signal-to-noise ratio

Figure 2 shows test mean squared error as a function of model width for the synthetic dataset with σ = 0. The curve exhibits the characteristic double descent shape. Test error decreases with width, rises sharply near the interpolation threshold, and decreases again for large models. The interpolation peak is substantial: the worst model (width 128) performs 8.8 times worse than the best model (width 16). At large widths, the model recovers low test error, indicating that implicit regularization guides overparameterized networks toward solutions that generalize well when the mapping is fully learnable.

Test mean squared error versus model width for synthetic data with σ = 0, showing a clear double descent curve.

Figure 2. Test MSE versus model width for synthetic data with σ = 0.

Experiment 2: Effect of decreasing signal-to-noise ratio

We next examine how label noise alters generalization. Figure 3 reveals a surprising pattern: simple models (width 8) consistently outperform complex models (width 1024) across all noise levels, with no crossover point observed. Even at σ=0 where neural networks achieve excellent performance (R²=0.99), and at σ = 4 the simple model achieves 34% lower error. Notably, the best-performing width shifts from 16 at σ=0 to 8 at all higher noise levels, indicating that optimal model complexity decreases as signal-to-noise ratio degrades. The threshold-width model (width 64) consistently exhibits the poorest robustness, reflecting its sensitivity to interpolation.

Table 2 quantifies how the interpolation peak evolves with noise. The peak consistently occurs near width 64 across all synthetic noise levels, which is consistent with theoretical predictions that the peak appears at the interpolation threshold. However, its magnitude decreases systematically: from 776% worse than the best model at σ=0, to 145% worse at σ=0.5, stabilizing around 142-145% for σ≥1.0. This 86-fold attenuation demonstrates that while the peak persists structurally, its practical impact diminishes as noise dominates signal.

Test mean squared error for simple, threshold, and complex models as label noise σ increases on synthetic data.

Figure 3. Performance of simple, threshold, and complex models as σ increases.

Figure 4 reports out-of-sample R² for the same three models. The threshold model fails first as noise increases, followed by the complex model, while the simple model remains most stable. This pattern is consistent with theoretical predictions: models near the interpolation threshold are the least robust because they transition between underparameterized and overparameterized regimes.

Out-of-sample R² for simple, threshold, and complex models across noise levels, showing that threshold models fail first.

Figure 4. Out-of-sample R² for representative models across noise levels.

We also compare neural networks with ridge regression. Figure 5 shows that ridge degrades smoothly as noise increases, while the best neural network across widths deteriorates sharply once noise dominates the signal. At high noise levels ridge regression outperforms all neural networks, maintaining positive R²=0.08 at σ=4 while the best neural network degrades to R²=-0.03. This illustrates that ridge's explicit L2 regularization and linear inductive bias provide graceful degradation, whereas neural networks' flexibility becomes a liability when signal is weak.

Comparison of ridge regression and neural networks across noise levels, highlighting that ridge is more robust at high noise.

Figure 5. Comparison of ridge regression and neural networks across noise levels.

These results confirm the second hypothesis: the interpolation peak flattens as σ increases, and by σ = 4 the double descent curve is nearly flat.

Experiment 3: Financial time series

Figure 6 shows test mean squared error versus model width for S&P 500 returns. The curve is almost flat across four orders of magnitude in parameter count. A small peak appears near width 64, but its magnitude is minimal—only 9 percent worse than the best model. All models achieve negative R², indicating that none outperform the naive mean predictor. This behavior matches the synthetic results at high noise levels.

Test mean squared error versus model width for S&P 500 data, showing a nearly flat curve with a very small interpolation peak.

Figure 6. Test MSE versus model width for S&P 500 data.

The contrast with synthetic data is striking. Table 2 quantifies this difference: on synthetic data with σ=0, the interpolation peak is 776% worse than the best model, while on S&P 500 data the peak is only 9% worse. This represents a 86-fold attenuation of the double descent structure. Moreover, the best neural network achieves R²=0.988 on synthetic data but R²=-0.002 on financial data, indicating that even the optimal model configuration fails to extract any predictive signal from equity returns. Real financial data behave like a system with signal-to-noise ratio far beyond σ=4, where the conditions necessary for double descent do not hold.

These results confirm the third hypothesis. When the data generating process is noisy and unstable, increasing model capacity does not improve generalization and can worsen robustness.

Extension: Could regularization rescue overparameterization?

One natural question arises from the preceding results: could explicit regularization techniques rescue overparameterization in low-signal domains? To address this, we repeated the width sweep experiments with four regularization strategies: dropout (p=0.2), weight decay (λ=0.001), early stopping on validation loss, and no regularization (baseline). We applied each method to three representative datasets: synthetic data with σ=0 (high signal), synthetic data with σ=4 (high noise), and S&P 500 returns (no signal). Figure 7 shows the results.

Four-panel figure showing regularization effects across noise levels: weight decay improves performance on clean data, early stopping helps at high noise, but all methods fail on S&P 500.

Figure 7. Regularization study across three datasets showing that different methods work in different noise regimes, but none help on financial data.

On clean synthetic data (σ=0), weight decay provided the strongest performance, improving R² from 0.988 to 0.993 by imposing an explicit structural prior. Early stopping and no regularization performed identically, while dropout reduced performance to 0.947 by removing too much model capacity. This demonstrates that when signal is strong, L2 regularization can improve upon implicit regularization alone.

At high noise (σ=4), the pattern reversed dramatically. Early stopping was the only method achieving positive R² (0.056), preventing the model from overfitting to noise during training. Both weight decay and no regularization failed catastrophically, producing large negative R² values with high variance across seeds. In this regime, the problem shifted from bias to variance explosion, and only early termination of training prevented models from fitting spurious noise patterns.

On S&P 500 returns, all four methods failed to achieve positive R² and produced very similar, nearly flat test error curves. Every regularization strategy produced negative R², with the best method (weight decay at R²=-0.0005) still worse than the naive mean predictor. This demonstrates that when predictive signal is fundamentally absent, no amount of regularization can improve generalization.

Our results suggest a hierarchy of regularization effectiveness that depends on signal-to-noise ratio: weight decay excels when signal is strong, early stopping helps when noise dominates, but nothing works when signal is absent. The failure of all methods on financial data confirms that overparameterization in this domain reflects a fundamental signal problem rather than a technical regularization problem. Different regularization strategies can modify the double descent curve under appropriate conditions, but they cannot create predictive structure where none exists.

Conclusion and limitations

This work examined whether the double descent phenomenon, widely documented in high signal machine learning tasks, extends to low signal financial time series. The experiments support all three hypotheses. On clean synthetic data, multilayer perceptrons reproduce the familiar double descent curve with a sharp interpolation peak and recovery at large widths. As label noise increases, this structure weakens and eventually disappears, and smaller models become more robust than larger ones. When the same pipeline is applied to S&P 500 returns, the test error curve is nearly flat, the interpolation peak is minimal, and all models fail to outperform the naive mean predictor. These findings suggest that, in these settings, double descent requires stable, learnable structure and does not arise in domains where noise dominates the conditional distribution of returns.

The results have practical implications. In settings with extremely low signal-to-noise ratio, increasing model capacity does not reliably improve generalization. Simpler models and classical regularization therefore remain appropriate. To test whether this failure could be overcome with explicit regularization, we evaluated dropout, weight decay, and early stopping across all noise levels. While different methods proved effective in different regimes (weight decay helped at σ=0, early stopping at σ=4), all four methods failed to achieve positive R² and produced very similar, nearly flat test error curves on S&P 500 data. This provides evidence that the observed failures stem from fundamental signal absence rather than insufficient regularization.

Beyond evaluating these hypotheses, this work contributes a systematic comparison between domains where double descent is known to occur and those where it fails. By combining controlled synthetic experiments with a matched financial study under an identical modeling protocol, we isolate signal-to-noise ratio as a primary driver of the interpolation peak and rule out insufficient regularization as an alternative explanation. This provides direct empirical evidence that overparameterized deep models do not confer the same benefits in low-signal, non-stationary settings. The results position this work not only as an application study, but as a domain-informed analysis of when modern deep learning intuitions break down.

Limitations

The study focuses on one-day-ahead prediction using fixed technical features and does not evaluate alternative horizons, richer feature sets, or fundamental data. The model class is restricted to fully connected networks with fixed depth; recurrent or attention-based models may behave differently. The financial analysis is limited to U.S. equities over a single decade, and results may vary across asset classes or market regimes.

Future work

Potential extensions include evaluating sequence-based architectures in settings with stronger temporal structure, exploring feature constructions that may increase signal strength, and testing whether other asset classes exhibit different generalization behavior. While explicit regularization techniques were shown not to rescue overparameterization in financial data, other architectural modifications such as attention mechanisms, ensemble methods, or meta-learning approaches may warrant investigation. Understanding which modeling choices can extract weak signal from noisy, non-stationary environments remains an open question.

References:

[1] M. Belkin, D. Hsu, S. Ma, and S. Mandal. Reconciling modern machine learning practice and the bias variance trade off. Proceedings of the National Academy of Sciences, 2019.

[2] P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever. Deep double descent: Where bigger models and more data hurt. arXiv preprint arXiv:1912.02292, 2019.

[3] T. Hastie, A. Montanari, S. Rosset, and R. Tibshirani. Surprises in high dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.

[4] P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 2020.

[5] S. Mei and A. Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. Probability Theory and Related Fields, 2022.

[6] B. Adlam and J. Pennington. Understanding double descent requires a fine grained bias variance decomposition. Advances in Neural Information Processing Systems, 2020.

[7] V. Assandri, S. Heshmati, B. Yaman, A. Iakovlev, and A. E. Repetur. Deep double descent for time series forecasting: Avoiding undertrained models. arXiv preprint arXiv:2311.01442, 2023.

[8] E. Noguer i Alonso and A. Srivastava. The shape of performance curve in financial time series. SSRN preprint 3986154, 2021.