Benchmark

This section documents the experimental results comparing three different U-Net architectures for downscaling within the IPSL-AID framework.

Model Performance Comparison

This presents a comprehensive comparison of three U-Net architectures trained for statistical downscaling of atmosphericvariables. All models were trained with identical hyperparameters.

Experiment Configuration

All models were trained with the following common configuration:

Dataset: ERA5 reanalysis data (2015-2019 training, 2020 validation)
Domain: Global
Input resolution: 721 × 1440 (0.25° grid)
Input channels: 10 (6 target variables + 4 constants)
Output channels: 6 (downscaled meteorological variables)
Target variables:
- VAR_2T: 2-meter temperature (K)
- VAR_10U: 10-meter U wind component (m/s)
- VAR_10V: 10-meter V wind component (m/s)
- VAR_TP: Total precipitation (m/h)
- VAR_D2M: 2-meter dewpoint temperature (K)
- VAR_ST: Skin temperature (K)
Normalization: Standard scaling (log1p for precipitation)
Time encoding: Sine/cosine of day-of-year (4 channels)
Constant variables: Orography (z) and land-sea mask (lsm)
Loss function: UNet diffusion loss (MSE-based)
Learning rate: 0.0001
Batch size: 36 (12 spatial × 1460 temporal)
Epochs: 20
Spatial batching: 12 tiles
Temporal batching: 1460 time steps

Model Architectures

Three U-Net variants were evaluated:

DDPM++ (SongUNet - Positional embedding)
- Denoising Diffusion Probabilistic Model architecture
- Positional timestep embedding
- Standard encoder/decoder with skip connections
- Channel multiplier: [2, 2, 2]
- Base channels: 128
- Resampling filter: [1, 1]
- Parameters: 54,429,958
NCSN++ (SongUNet - Fourier embedding)
- Noise-Conditioned Score Network architecture
- Fourier feature timestep embedding
- Residual encoder with skip connections
- Channel multiplier: [2, 2, 2]
- Base channels: 128
- Resampling filter: [1, 3, 3, 1]
- Parameters: 55,109,510
ADM (DhariwalUNet)
- Ablated Diffusion Model architecture
- Multi-resolution attention (32, 16, 8)
- Channel multiplier: [1, 2, 3, 4]
- Base channels: 128
- Number of blocks: 2
- Parameters: 92,140,550

Performance Metrics

The following metrics were used for evaluation on the validation set (year 2020):

Loss: UNet diffusion loss value
MAE: Mean Absolute Error (normalized scale)
NMAE: Normalized Mean Absolute Error (normalized by variable range)
RMSE: Root Mean Square Error (normalized scale)
R²: Coefficient of determination
Pearson: Pearson correlation coefficient
KL: KL divergence (distribution similarity)

Quantitative Comparison - Overall Metrics

Performance comparison for all variables (validation set)
Architecture	Loss ↓	MAE ↓	NMAE ↓	RMSE ↓	R² ↑	Pearson ↑
DDPM++ (SongUNet)	0.0524	0.3458 ± 0.0069	0.1170 ± 0.0035	0.5604 ± 0.0164	0.9482 ± 0.0025	0.9722 ± 0.0014
NCSN++ (SongUNet)	0.0517	0.3432 ± 0.0068	0.1176 ± 0.0037	0.5552 ± 0.0159	0.9489 ± 0.0027	0.9725 ± 0.0016
ADM (DhariwalUNet)	0.0527	0.3500 ± 0.0071	0.1179 ± 0.0035	0.5656 ± 0.0166	0.9480 ± 0.0023	0.9721 ± 0.0013

Note: ↓ indicates lower is better, ↑ indicates higher is better. Values show mean ± std across spatial batches.

Baseline Comparison

For reference, coarse input (bilinear interpolation of low-resolution input) metrics are provided:

Baseline coarse input performance (all variables)
Baseline	MAE	NMAE	RMSE	R²	Pearson
Coarse Input	0.6993 ± 0.0214	0.1981 ± 0.0045	1.2208 ± 0.0368	0.8873 ± 0.0039	0.9377 ± 0.0028

Improvement over baseline: All three U-Net architectures achieve significant improvements, reducing MAE by approximately 50% and increasing R² from 0.887 to 0.948+.

Per-Variable Performance

VAR_2T (2-meter Temperature)

VAR_2T performance comparison
Architecture	MAE ↓	RMSE ↓	R² ↑	Pearson ↑	KL ↓
DDPM++	0.3697 ± 0.0095	0.5873 ± 0.0171	0.9992 ± 0.0001	0.9996 ± 0.0001	0.0010
NCSN++	0.3684 ± 0.0097	0.5816 ± 0.0168	0.9992 ± 0.0001	0.9996 ± 0.0001	0.0011
ADM	0.3775 ± 0.0103	0.5968 ± 0.0176	0.9992 ± 0.0001	0.9996 ± 0.0001	0.0007

VAR_10U (10-meter U Wind)

VAR_10U performance comparison
Architecture	MAE ↓	RMSE ↓	R² ↑	Pearson ↑	KL ↓
DDPM++	0.3938 ± 0.0071	0.5921 ± 0.0237	0.9886 ± 0.0014	0.9943 ± 0.0007	0.0006
NCSN++	0.3905 ± 0.0069	0.5867 ± 0.0228	0.9888 ± 0.0013	0.9944 ± 0.0007	0.0005
ADM	0.3966 ± 0.0071	0.5960 ± 0.0227	0.9885 ± 0.0014	0.9942 ± 0.0007	0.0005

VAR_10V (10-meter V Wind)

VAR_10V performance comparison
Architecture	MAE ↓	RMSE ↓	R² ↑	Pearson ↑	KL ↓
DDPM++	0.3813 ± 0.0068	0.5701 ± 0.0240	0.9859 ± 0.0016	0.9929 ± 0.0008	0.0006
NCSN++	0.3792 ± 0.0068	0.5663 ± 0.0237	0.9861 ± 0.0016	0.9930 ± 0.0008	0.0005
ADM	0.3844 ± 0.0066	0.5739 ± 0.0238	0.9857 ± 0.0016	0.9928 ± 0.0008	0.0005

VAR_TP (Total Precipitation)

VAR_TP performance comparison (most challenging variable)
Architecture	MAE ↓	NMAE ↓	R² ↑	Pearson ↑	KL ↓
DDPM++	0.0001	0.5058 ± 0.0145	0.7182 ± 0.0133	0.8478 ± 0.0079	0.1427
NCSN++	0.0001	0.5105 ± 0.0155	0.7220 ± 0.0148	0.8497 ± 0.0088	0.1326
ADM	0.0001	0.5096 ± 0.0145	0.7176 ± 0.0133	0.8474 ± 0.0079	0.2088

Note: MAE values are in normalized scale; precipitation shows lowest absolute error due to large number of zero in the dataset.

VAR_D2M (2-meter Dewpoint Temperature)

VAR_D2M performance comparison
Architecture	MAE ↓	RMSE ↓	R² ↑	Pearson ↑	KL ↓
DDPM++	0.4515 ± 0.0117	0.7178 ± 0.0221	0.9987 ± 0.0003	0.9994 ± 0.0002	0.0016
NCSN++	0.4450 ± 0.0115	0.7104 ± 0.0222	0.9988 ± 0.0003	0.9994 ± 0.0002	0.0015
ADM	0.4551 ± 0.0117	0.7261 ± 0.0215	0.9987 ± 0.0003	0.9994 ± 0.0002	0.0017

VAR_ST (Skin Temperature)

VAR_ST performance comparison
Architecture	MAE ↓	RMSE ↓	R² ↑	Pearson ↑	KL ↓
DDPM++	0.4785 ± 0.0214	0.8947 ± 0.0485	0.9983 ± 0.0004	0.9991 ± 0.0002	0.0101
NCSN++	0.4759 ± 0.0216	0.8862 ± 0.0485	0.9983 ± 0.0004	0.9991 ± 0.0002	0.0184
ADM	0.4859 ± 0.0215	0.9004 ± 0.0501	0.9982 ± 0.0004	0.9991 ± 0.0002	0.0116

Model Complexity Comparison

Model complexity and efficiency
Architecture	Parameters	Relative Size	Inference Characteristics
DDPM++ (SongUNet)	54.4M	1.0×	Lightweight, fast inference
NCSN++ (SongUNet)	55.1M	1.01×	Slightly larger, Fourier embeddings
ADM (DhariwalUNet)	92.1M	1.69×	Larger model, multi-resolution attention

Key Findings

Best Overall Performance: The NCSN++ architecture achieves the best overall metrics:

Lowest loss (0.0517 vs 0.0524 for DDPM++ and 0.0527 for ADM)
Lowest MAE (0.3432 vs 0.3458/0.3500)
Highest R² (0.9489 vs 0.9482/0.9480)
Highest Pearson correlation (0.9725 vs 0.9722/0.9721)

Best Performance for Precipitation: NCSN++ achieves the highest R² for VAR_TP (0.7220) and lowest KL divergence (0.1326), indicating better distribution matching.

Best Performance for Wind Fields: NCSN++ consistently outperforms for both U and V wind components across all metrics.

Most Challenging Variable: Precipitation (VAR_TP) shows the lowest R² scores (0.718-0.722) and highest NMAE (0.506-0.511), reflecting the difficulty of downscaling intermittent precipitation events.

Model Efficiency: The DDPM++ architecture has the fewest parameters (54.4M) while maintaining competitive performance, making it suitable for resource-constrained applications.

Wind Field Anisotropy: Performance is slightly better for U-wind (R² ~0.9888) than V-wind (R² ~0.9861), which may reflect the zonal dominance of atmospheric circulation.

Recommendations

Based on the comprehensive comparison across 6 meteorological variables:

For maximum accuracy: Use NCSN++ (SongUNet with Fourier embeddings)
- Best overall performance across nearly all metrics
- Superior handling of precipitation distributions
- Marginal parameter increase over DDPM++
For balanced performance: Use DDPM++ (SongUNet with positional embeddings)
- Excellent performance with slightly fewer parameters
- Competitive across all variables
- Best for resource-constrained deployment
For temperature-sensitive applications: All three models perform excellently (R² > 0.999), with minimal differences
For precipitation downscaling: NCSN++ is the recommended choice due to superior distribution matching and higher R²
For ensemble applications: Consider all three as they show complementary strengths across different variable types

Note on ADM Performance

While the ADM architecture achieves competitive performance, it underperforms both SongUNet variants despite having nearly 1.7× more parameters. This suggests that:

The SongUNet architecture is better suited for downscaling tasks
The simplified U-Net design with fewer attention layers may generalize better
The additional complexity of ADM does not translate to improved performance for this application