Conditional updates of neural network weights for increased out of training performance

Saynisch-Wagner, Jan; Sari, Saran Rajendran

doi:10.48550/arXiv.2512.03653

Preprints

https://doi.org/10.48550/arXiv.2512.03653

Preprints

23 Feb 2026

| 23 Feb 2026

Status: this preprint is open for discussion and under review for Nonlinear Processes in Geophysics (NPG).

Conditional updates of neural network weights for increased out of training performance

Jan Saynisch-Wagner and Saran Rajendran Sari

Abstract. This study proposes a method to enhance neural network performance when training data and application data are not very similar, e.g., out of distribution problems, as well as pattern and regime shifts. The method consists of three main steps: 1) Retrain the neural network towards reasonable subsets of the training data set and note down the resulting weight anomalies. 2) Choose reasonable predictors and derive a regression between the predictors and the weight anomalies. 3) Extrapolate the weights, and thereby the neural network, to the application data. We show and discuss this method in three nonlinear use cases from the climate sciences, which include successful temporal, spatial and cross-domain extrapolations of neural networks.

Received: 06 Feb 2026 – Discussion started: 23 Feb 2026

Jan Saynisch-Wagner and Saran Rajendran Sari

Status: open (until 20 Apr 2026)

Post a comment Subscribe to comment alert

RC1: 'Comment on egusphere-2026-728', Anonymous Referee #1, 12 Mar 2026 reply

This manuscript describes an approach to improve neural network (NN) predictions in settings where the data on which the NN is trained differ (substantially) from those in the setup to which the NN will ultimately be applied. This situation is frequently encountered in earth sciences (and presumably other areas of application), and thus this research topic is highly relevant to applications.
General comment:
The proposed approach to address this issue is more like a general procedure than a specific method, with many modeling choices being left to the user depending on the setup in which this approach is applied. I understand that this is inevitable, but I would then expect a solid theoretical motivation of the proposed approach that gives a good understanding when and why it works. I feel that this is still lacking in the current version of the manuscript. While the three use cases are good examples to demonstrate that the proposed approach can yield notable improvements while also discussing shortcomings and open questions, I must admit that I am somewhat puzzled that the proposed procedure works at all. In the introduction, the authors motivate this research by explaining that in non-linear problems, a systematic shift in the input data does not easily translate to a corresponding shift in the output. But isn't this also true for the NN weights? In my understanding, NN weights are typically not even identifiable, i.e. completely different weights can yield the same (or at least very similar) output. So, how is it possible that that these weights can be extrapolated by (e.g. linear) methods in a stable way? Intuitively, I would expect this to work only if the weights of ParentModel_i are guaranteed to remain relatively close to those of ParentModel₀, but it doesn't seem that the authors tried to control this in any way. Please provide more motivation and explanation (which could e.g. include a more detailed analysis and visualization of the weight extrapolation underlying the different curved in Fig. 1) of why you expect this approach to work in general, or, which conditions have to be satisfied for this weight extrapolation to work. If any tuning parameters were involved in getting the positive results shown in the three use cases, it would be useful to share this experience to help guide the application of this approach in different settings.

Specific comments:
4.1, 2nd paragraph, 'Note, however, that the sensitivity ...': Can you explain this statement further, i.e., in what way do the activation functions affect the sensitivity of the weights? More generally, this entire paragraph seems to assume good familiarity with this type of problem and is somewhat short on details.
4.1, 3rd paragraph, 'As there is a clear order ...': This sounds like a substantial departure from the proposed approach. Doesn't this imply that instead of anomalies from ParentModel₀, the weight regression is applied to the increments corresponding to subsequent time points. Please clarify.
4.1, 3rd paragraph, '... every weight and bias of the CNN ...': So, including the weights of the convolution layers? It is again surprising and somewhat counter-intuitive to me that weights of a convolution layer can be (linearly) extrapolated in a meaningful way. Is there any way to visualize the evolution of just the first convolution weights over time, and the associated extrapolation to after the tipping event? Is it possible to quantify in which layer the most impactful extrapolation happens, that improves the NNs performance during and at the tipping event?
4.2, 1st paragraph, '... only individual x_i_': Can you explain this further, i.e. a) clarify what you mean by individual x_i b) what you mean by regularization towards the entire training data set (I believe this has never been explained in detail) and c) why you chose a different approach here.

Reply

Citation: https://doi.org/10.5194/egusphere-2026-728-RC1
RC2: 'Comment on egusphere-2026-728', Anonymous Referee #2, 25 Mar 2026 reply

Extrapolation has long been recognized as a potentially serious problem when using neural network (NN) models. This manuscript proposes an interesting new method using mainly linear regression to improve on NN extrapolation, which may improve on earlier, simpler linear methods (e.g., <https://www.jeionline.org/index.php?journal=mys&page=article&op=view&path%5B%5D=202300493>). There are places in the manuscript where it is hard to follow, and the use of higher order polynomials in the linear regression is a poor approach that should be corrected.
Major comment:
-Sect. 4.1, Paragraph 10:

Using polynomials as basis functions in a linear regression model is a bad choice of basis functions when one is extrapolating outside the training data domain. To model a nonlinear relation y = f(x) with a linear regression model, people tend to use a sum of bounded basis functions such as Gaussian or sigmoidal functions, not unbounded basis functions like polynomials, which have wild extrapolation behavior. I think this part needs to be redone with better behaving basis functions.

Minor comments:
-Sect.2.1, paragraph 3: 1 Sv = 10⁶ ...
-Table 1, step 2: What is a "reasonable subset"? 5%, 10%, 20%, 50%?
-Table 1, step 2(a): You mean: n copies of a single data point x_i?

Also, when choosing the x_i data points, would you try to choose ones that are closer to the outer boundary of the data cloud than near the center of the data cloud, which presumably would not help with the extrapolation problem?
-Table 1, step 3: The word "target data" is confusing, since in standard NN terminology, in (x,y) regression problems, the target data are the observed values of y. Here it seems to mean data points in the training data set?
-Sect. 4.1, paragraph 2: EOFs are now a little old-fashioned in the age of NN. For future research, one could use an autoencoder NN with the neurons in the bottleneck layer giving the nonlinear principal components (the NN can either be the multi-layer perceptron model or the CNN model). With CNN, one could use a U-Net model to extract nonlinear principal components (<https://gmd.copernicus.org/articles/13/1609/2020/>) (e.g. have 4 neurons in the bottleneck layer to give 4 nonlinear principal components).
-Sect. 4.1, Paragraph 10: "regressions of polynomials of order 1 and order 2 give comparable results" does not look right. I think order 2 is noticeably worse than order 1, e.g. in the box plots (Fig. 2), the bottom of the blue boxes went below the zero line for order 2 but not for order 1.
-Sect. 4.3, paragraph 2: ELU activation functions: Are the ELU functions used in both layers?
-Sect. 4.3, final paragraph: I don't see anything contradictory. A single NN model is developed to do interpolation, not extrapolation. You then develop a second NN to help with extrapolation.

Reply

Citation: https://doi.org/10.5194/egusphere-2026-728-RC2
RC3:
'Comment on egusphere-2026-728', Anonymous Referee #3, 07 Apr 2026 reply
The manuscript, “Conditional updates of neural network weights for increased out of training performance,” presents a novel methodology aimed at improving neural network (NN) performance when the training data are not representative of the distribution encountered at inference time. This includes both out-of-distribution scenarios and shifts between application regimes. The method is evaluated in three different settings, where the adapted models are reported to outperform the corresponding baseline model.
General comment:
For all three applications, the proposed method is compared only against a “naive” baseline model. While the authors demonstrate improvements relative to this baseline, the absence of comparisons with alternative approaches developed for related out-of-distribution or regime-shift problems makes it difficult to assess the practical significance of the reported gains. The challenge of degraded deep-learning performance under distribution shift has been extensively studied, yet the manuscript does not provide any comparison with existing methods from that literature. Although the proposed framework may be more general than many previously published approaches, the current results lack a meaningful frame of reference beyond a standard neural network. I therefore encourage the authors to include, for each application, at least one additional benchmark method, training strategy, or architecture from the relevant literature. This would allow the reader to assess whether the proposed approach offers advantages in predictive skill, computational cost, robustness, or generality relative to existing alternatives. If the authors do not wish to benchmark against a broad range of methods, they should at least justify why the chosen baseline is sufficient for each application and clarify what specific advantage their method is intended to provide over existing approaches.
Minor comments:
The proposed approach involves several methodological choices, including, for example, the selection of weight-prediction and weight-regression strategies. A summary table for each experiment, listing the specific design choices adopted, would help the reader follow the setup and focus more clearly on the results.

The first paragraph in Section 1 provides well cited statements. However, the remaining of the introduction has many statements which are not supported by citation, even though they would require them.

Some acronyms are introduced without definition, for example AI and ARGO. All acronyms should be defined at first use.

There are some inconsistencies in acronym formatting and capitalization. For example, neural networks is not capitalized when the acronym is introduced in Section 1, whereas Equation of State is capitalized when introduced in Section 2.2. The authors should check the manuscript for consistency in this regard.

In Section 2, I would recommend presenting the methodology before the individual experiments. This would allow the reader to understand the general framework before encountering the application-specific implementations.

In Section 2.3, the terminology may need to be reconsidered. I am not convinced that cross-domain is the most appropriate description of Experiment 3. Based on both the setup and the later discussion, this case seems more naturally characterized as spatiotemporal extrapolation.

There are a number of grammatical issues and typographical errors throughout the draft. For example, phrases such as “the training bases only on data” and “Polynomials of order 1 and 2 where fitted” should be corrected.

I would encourage the authors to provide access to an implementation of the proposed approach, for example through a public code repository. This would greatly facilitate reproducibility and allow readers to test and use the method more easily.

Reply
Citation: https://doi.org/10.5194/egusphere-2026-728-RC3

Jan Saynisch-Wagner and Saran Rajendran Sari

Viewed

Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.

Total article views: 146 (including HTML, PDF, and XML)

HTML	PDF	XML	Total	BibTeX	EndNote
145	0	1	146	0	0

HTML: 145
PDF: 0
XML: 1
Total: 146
BibTeX: 0
EndNote: 0

Views and downloads (calculated since 23 Feb 2026)

Month	HTML	PDF	XML
Feb 2026	38	0	38
Mar 2026	91	1	92
Apr 2026	16	0	16

Cumulative views and downloads (calculated since 23 Feb 2026)

Month	HTML	PDF	XML
Feb 2026	38	0	38
Mar 2026	91	1	92
Apr 2026	16	0	16

Viewed (geographical distribution)

Since the preprint corresponding to this journal article was posted outside of Copernicus Publications, the preprint-related metrics are limited to HTML views.

Total article views: 145 (including HTML, PDF, and XML) Thereof 145 with geography defined and 0 with unknown origin.

Country	#	Views	%

Latest update: 12 Apr 2026

Short summary

Neural networks are limited in situations which differ from the learned conditions. We propose a solution to this out of distribution problem. We derive anomalies of trained neural networks internal parameters by retraining on subsets of the same training data. Then we relate the network-parameter sensitivities to differences in the training data subsets that caused them. Finally, we extrapolate the found relations to generate networks that perform better outside the training distribution.


Total:	0
HTML:	0
PDF:	0
XML:	0