Benchmark
This section documents the experimental results comparing three different U-Net architectures for downscaling within the IPSL-AID framework.
Model Performance Comparison
This presents a comprehensive comparison of three U-Net architectures trained for statistical downscaling of atmosphericvariables. All models were trained with identical hyperparameters.
Experiment Configuration
All models were trained with the following common configuration:
Dataset: ERA5 reanalysis data (2015-2019 training, 2020 validation)
Domain: Global
Input resolution: 721 × 1440 (0.25° grid)
Input channels: 10 (6 target variables + 4 constants)
Output channels: 6 (downscaled meteorological variables)
Target variables:
VAR_2T: 2-meter temperature (K)
VAR_10U: 10-meter U wind component (m/s)
VAR_10V: 10-meter V wind component (m/s)
VAR_TP: Total precipitation (m/h)
VAR_D2M: 2-meter dewpoint temperature (K)
VAR_ST: Skin temperature (K)
Normalization: Standard scaling (log1p for precipitation)
Time encoding: Sine/cosine of day-of-year (4 channels)
Constant variables: Orography (z) and land-sea mask (lsm)
Loss function: UNet diffusion loss (MSE-based)
Learning rate: 0.0001
Batch size: 36 (12 spatial × 1460 temporal)
Epochs: 20
Spatial batching: 12 tiles
Temporal batching: 1460 time steps
Model Architectures
Three U-Net variants were evaluated:
DDPM++ (SongUNet - Positional embedding)
Denoising Diffusion Probabilistic Model architecture
Positional timestep embedding
Standard encoder/decoder with skip connections
Channel multiplier: [2, 2, 2]
Base channels: 128
Resampling filter: [1, 1]
Parameters: 54,429,958
NCSN++ (SongUNet - Fourier embedding)
Noise-Conditioned Score Network architecture
Fourier feature timestep embedding
Residual encoder with skip connections
Channel multiplier: [2, 2, 2]
Base channels: 128
Resampling filter: [1, 3, 3, 1]
Parameters: 55,109,510
ADM (DhariwalUNet)
Ablated Diffusion Model architecture
Multi-resolution attention (32, 16, 8)
Channel multiplier: [1, 2, 3, 4]
Base channels: 128
Number of blocks: 2
Parameters: 92,140,550
Performance Metrics
The following metrics were used for evaluation on the validation set (year 2020):
Loss: UNet diffusion loss value
MAE: Mean Absolute Error (normalized scale)
NMAE: Normalized Mean Absolute Error (normalized by variable range)
RMSE: Root Mean Square Error (normalized scale)
R²: Coefficient of determination
Pearson: Pearson correlation coefficient
KL: KL divergence (distribution similarity)
Quantitative Comparison - Overall Metrics
Architecture |
Loss ↓ |
MAE ↓ |
NMAE ↓ |
RMSE ↓ |
R² ↑ |
Pearson ↑ |
|---|---|---|---|---|---|---|
DDPM++ (SongUNet) |
0.0524 |
0.3458 ± 0.0069 |
0.1170 ± 0.0035 |
0.5604 ± 0.0164 |
0.9482 ± 0.0025 |
0.9722 ± 0.0014 |
NCSN++ (SongUNet) |
0.0517 |
0.3432 ± 0.0068 |
0.1176 ± 0.0037 |
0.5552 ± 0.0159 |
0.9489 ± 0.0027 |
0.9725 ± 0.0016 |
ADM (DhariwalUNet) |
0.0527 |
0.3500 ± 0.0071 |
0.1179 ± 0.0035 |
0.5656 ± 0.0166 |
0.9480 ± 0.0023 |
0.9721 ± 0.0013 |
Note: ↓ indicates lower is better, ↑ indicates higher is better. Values show mean ± std across spatial batches.
Baseline Comparison
For reference, coarse input (bilinear interpolation of low-resolution input) metrics are provided:
Baseline |
MAE |
NMAE |
RMSE |
R² |
Pearson |
|---|---|---|---|---|---|
Coarse Input |
0.6993 ± 0.0214 |
0.1981 ± 0.0045 |
1.2208 ± 0.0368 |
0.8873 ± 0.0039 |
0.9377 ± 0.0028 |
Improvement over baseline: All three U-Net architectures achieve significant improvements, reducing MAE by approximately 50% and increasing R² from 0.887 to 0.948+.
Per-Variable Performance
VAR_2T (2-meter Temperature)
Architecture |
MAE ↓ |
RMSE ↓ |
R² ↑ |
Pearson ↑ |
KL ↓ |
|---|---|---|---|---|---|
DDPM++ |
0.3697 ± 0.0095 |
0.5873 ± 0.0171 |
0.9992 ± 0.0001 |
0.9996 ± 0.0001 |
0.0010 |
NCSN++ |
0.3684 ± 0.0097 |
0.5816 ± 0.0168 |
0.9992 ± 0.0001 |
0.9996 ± 0.0001 |
0.0011 |
ADM |
0.3775 ± 0.0103 |
0.5968 ± 0.0176 |
0.9992 ± 0.0001 |
0.9996 ± 0.0001 |
0.0007 |
VAR_10U (10-meter U Wind)
Architecture |
MAE ↓ |
RMSE ↓ |
R² ↑ |
Pearson ↑ |
KL ↓ |
|---|---|---|---|---|---|
DDPM++ |
0.3938 ± 0.0071 |
0.5921 ± 0.0237 |
0.9886 ± 0.0014 |
0.9943 ± 0.0007 |
0.0006 |
NCSN++ |
0.3905 ± 0.0069 |
0.5867 ± 0.0228 |
0.9888 ± 0.0013 |
0.9944 ± 0.0007 |
0.0005 |
ADM |
0.3966 ± 0.0071 |
0.5960 ± 0.0227 |
0.9885 ± 0.0014 |
0.9942 ± 0.0007 |
0.0005 |
VAR_10V (10-meter V Wind)
Architecture |
MAE ↓ |
RMSE ↓ |
R² ↑ |
Pearson ↑ |
KL ↓ |
|---|---|---|---|---|---|
DDPM++ |
0.3813 ± 0.0068 |
0.5701 ± 0.0240 |
0.9859 ± 0.0016 |
0.9929 ± 0.0008 |
0.0006 |
NCSN++ |
0.3792 ± 0.0068 |
0.5663 ± 0.0237 |
0.9861 ± 0.0016 |
0.9930 ± 0.0008 |
0.0005 |
ADM |
0.3844 ± 0.0066 |
0.5739 ± 0.0238 |
0.9857 ± 0.0016 |
0.9928 ± 0.0008 |
0.0005 |
VAR_TP (Total Precipitation)
Architecture |
MAE ↓ |
NMAE ↓ |
R² ↑ |
Pearson ↑ |
KL ↓ |
|---|---|---|---|---|---|
DDPM++ |
0.0001 |
0.5058 ± 0.0145 |
0.7182 ± 0.0133 |
0.8478 ± 0.0079 |
0.1427 |
NCSN++ |
0.0001 |
0.5105 ± 0.0155 |
0.7220 ± 0.0148 |
0.8497 ± 0.0088 |
0.1326 |
ADM |
0.0001 |
0.5096 ± 0.0145 |
0.7176 ± 0.0133 |
0.8474 ± 0.0079 |
0.2088 |
Note: MAE values are in normalized scale; precipitation shows lowest absolute error due to large number of zero in the dataset.
VAR_D2M (2-meter Dewpoint Temperature)
Architecture |
MAE ↓ |
RMSE ↓ |
R² ↑ |
Pearson ↑ |
KL ↓ |
|---|---|---|---|---|---|
DDPM++ |
0.4515 ± 0.0117 |
0.7178 ± 0.0221 |
0.9987 ± 0.0003 |
0.9994 ± 0.0002 |
0.0016 |
NCSN++ |
0.4450 ± 0.0115 |
0.7104 ± 0.0222 |
0.9988 ± 0.0003 |
0.9994 ± 0.0002 |
0.0015 |
ADM |
0.4551 ± 0.0117 |
0.7261 ± 0.0215 |
0.9987 ± 0.0003 |
0.9994 ± 0.0002 |
0.0017 |
VAR_ST (Skin Temperature)
Architecture |
MAE ↓ |
RMSE ↓ |
R² ↑ |
Pearson ↑ |
KL ↓ |
|---|---|---|---|---|---|
DDPM++ |
0.4785 ± 0.0214 |
0.8947 ± 0.0485 |
0.9983 ± 0.0004 |
0.9991 ± 0.0002 |
0.0101 |
NCSN++ |
0.4759 ± 0.0216 |
0.8862 ± 0.0485 |
0.9983 ± 0.0004 |
0.9991 ± 0.0002 |
0.0184 |
ADM |
0.4859 ± 0.0215 |
0.9004 ± 0.0501 |
0.9982 ± 0.0004 |
0.9991 ± 0.0002 |
0.0116 |
Model Complexity Comparison
Architecture |
Parameters |
Relative Size |
Inference Characteristics |
|---|---|---|---|
DDPM++ (SongUNet) |
54.4M |
1.0× |
Lightweight, fast inference |
NCSN++ (SongUNet) |
55.1M |
1.01× |
Slightly larger, Fourier embeddings |
ADM (DhariwalUNet) |
92.1M |
1.69× |
Larger model, multi-resolution attention |
Key Findings
Best Overall Performance: The NCSN++ architecture achieves the best overall metrics:
Lowest loss (0.0517 vs 0.0524 for DDPM++ and 0.0527 for ADM)
Lowest MAE (0.3432 vs 0.3458/0.3500)
Highest R² (0.9489 vs 0.9482/0.9480)
Highest Pearson correlation (0.9725 vs 0.9722/0.9721)
Best Performance for Precipitation: NCSN++ achieves the highest R² for VAR_TP (0.7220) and lowest KL divergence (0.1326), indicating better distribution matching.
Best Performance for Wind Fields: NCSN++ consistently outperforms for both U and V wind components across all metrics.
Most Challenging Variable: Precipitation (VAR_TP) shows the lowest R² scores (0.718-0.722) and highest NMAE (0.506-0.511), reflecting the difficulty of downscaling intermittent precipitation events.
Model Efficiency: The DDPM++ architecture has the fewest parameters (54.4M) while maintaining competitive performance, making it suitable for resource-constrained applications.
Wind Field Anisotropy: Performance is slightly better for U-wind (R² ~0.9888) than V-wind (R² ~0.9861), which may reflect the zonal dominance of atmospheric circulation.
Recommendations
Based on the comprehensive comparison across 6 meteorological variables:
For maximum accuracy: Use NCSN++ (SongUNet with Fourier embeddings)
Best overall performance across nearly all metrics
Superior handling of precipitation distributions
Marginal parameter increase over DDPM++
For balanced performance: Use DDPM++ (SongUNet with positional embeddings)
Excellent performance with slightly fewer parameters
Competitive across all variables
Best for resource-constrained deployment
For temperature-sensitive applications: All three models perform excellently (R² > 0.999), with minimal differences
For precipitation downscaling: NCSN++ is the recommended choice due to superior distribution matching and higher R²
For ensemble applications: Consider all three as they show complementary strengths across different variable types
Note on ADM Performance
While the ADM architecture achieves competitive performance, it underperforms both SongUNet variants despite having nearly 1.7× more parameters. This suggests that:
The SongUNet architecture is better suited for downscaling tasks
The simplified U-Net design with fewer attention layers may generalize better
The additional complexity of ADM does not translate to improved performance for this application