Benchmark

This section documents the experimental results comparing three different U-Net architectures for downscaling within the IPSL-AID framework.

Model Performance Comparison

This presents a comprehensive comparison of three U-Net architectures trained for statistical downscaling of atmosphericvariables. All models were trained with identical hyperparameters.

Experiment Configuration

All models were trained with the following common configuration:

  • Dataset: ERA5 reanalysis data (2015-2019 training, 2020 validation)

  • Domain: Global

  • Input resolution: 721 × 1440 (0.25° grid)

  • Input channels: 10 (6 target variables + 4 constants)

  • Output channels: 6 (downscaled meteorological variables)

  • Target variables:

    • VAR_2T: 2-meter temperature (K)

    • VAR_10U: 10-meter U wind component (m/s)

    • VAR_10V: 10-meter V wind component (m/s)

    • VAR_TP: Total precipitation (m/h)

    • VAR_D2M: 2-meter dewpoint temperature (K)

    • VAR_ST: Skin temperature (K)

  • Normalization: Standard scaling (log1p for precipitation)

  • Time encoding: Sine/cosine of day-of-year (4 channels)

  • Constant variables: Orography (z) and land-sea mask (lsm)

  • Loss function: UNet diffusion loss (MSE-based)

  • Learning rate: 0.0001

  • Batch size: 36 (12 spatial × 1460 temporal)

  • Epochs: 20

  • Spatial batching: 12 tiles

  • Temporal batching: 1460 time steps

Model Architectures

Three U-Net variants were evaluated:

  1. DDPM++ (SongUNet - Positional embedding)

    • Denoising Diffusion Probabilistic Model architecture

    • Positional timestep embedding

    • Standard encoder/decoder with skip connections

    • Channel multiplier: [2, 2, 2]

    • Base channels: 128

    • Resampling filter: [1, 1]

    • Parameters: 54,429,958

  2. NCSN++ (SongUNet - Fourier embedding)

    • Noise-Conditioned Score Network architecture

    • Fourier feature timestep embedding

    • Residual encoder with skip connections

    • Channel multiplier: [2, 2, 2]

    • Base channels: 128

    • Resampling filter: [1, 3, 3, 1]

    • Parameters: 55,109,510

  3. ADM (DhariwalUNet)

    • Ablated Diffusion Model architecture

    • Multi-resolution attention (32, 16, 8)

    • Channel multiplier: [1, 2, 3, 4]

    • Base channels: 128

    • Number of blocks: 2

    • Parameters: 92,140,550

Performance Metrics

The following metrics were used for evaluation on the validation set (year 2020):

  • Loss: UNet diffusion loss value

  • MAE: Mean Absolute Error (normalized scale)

  • NMAE: Normalized Mean Absolute Error (normalized by variable range)

  • RMSE: Root Mean Square Error (normalized scale)

  • : Coefficient of determination

  • Pearson: Pearson correlation coefficient

  • KL: KL divergence (distribution similarity)

Quantitative Comparison - Overall Metrics

Performance comparison for all variables (validation set)

Architecture

Loss ↓

MAE ↓

NMAE ↓

RMSE ↓

R² ↑

Pearson ↑

DDPM++ (SongUNet)

0.0524

0.3458 ± 0.0069

0.1170 ± 0.0035

0.5604 ± 0.0164

0.9482 ± 0.0025

0.9722 ± 0.0014

NCSN++ (SongUNet)

0.0517

0.3432 ± 0.0068

0.1176 ± 0.0037

0.5552 ± 0.0159

0.9489 ± 0.0027

0.9725 ± 0.0016

ADM (DhariwalUNet)

0.0527

0.3500 ± 0.0071

0.1179 ± 0.0035

0.5656 ± 0.0166

0.9480 ± 0.0023

0.9721 ± 0.0013

Note: ↓ indicates lower is better, ↑ indicates higher is better. Values show mean ± std across spatial batches.

Baseline Comparison

For reference, coarse input (bilinear interpolation of low-resolution input) metrics are provided:

Baseline coarse input performance (all variables)

Baseline

MAE

NMAE

RMSE

Pearson

Coarse Input

0.6993 ± 0.0214

0.1981 ± 0.0045

1.2208 ± 0.0368

0.8873 ± 0.0039

0.9377 ± 0.0028

Improvement over baseline: All three U-Net architectures achieve significant improvements, reducing MAE by approximately 50% and increasing R² from 0.887 to 0.948+.

Per-Variable Performance

VAR_2T (2-meter Temperature)

VAR_2T performance comparison

Architecture

MAE ↓

RMSE ↓

R² ↑

Pearson ↑

KL ↓

DDPM++

0.3697 ± 0.0095

0.5873 ± 0.0171

0.9992 ± 0.0001

0.9996 ± 0.0001

0.0010

NCSN++

0.3684 ± 0.0097

0.5816 ± 0.0168

0.9992 ± 0.0001

0.9996 ± 0.0001

0.0011

ADM

0.3775 ± 0.0103

0.5968 ± 0.0176

0.9992 ± 0.0001

0.9996 ± 0.0001

0.0007

VAR_10U (10-meter U Wind)

VAR_10U performance comparison

Architecture

MAE ↓

RMSE ↓

R² ↑

Pearson ↑

KL ↓

DDPM++

0.3938 ± 0.0071

0.5921 ± 0.0237

0.9886 ± 0.0014

0.9943 ± 0.0007

0.0006

NCSN++

0.3905 ± 0.0069

0.5867 ± 0.0228

0.9888 ± 0.0013

0.9944 ± 0.0007

0.0005

ADM

0.3966 ± 0.0071

0.5960 ± 0.0227

0.9885 ± 0.0014

0.9942 ± 0.0007

0.0005

VAR_10V (10-meter V Wind)

VAR_10V performance comparison

Architecture

MAE ↓

RMSE ↓

R² ↑

Pearson ↑

KL ↓

DDPM++

0.3813 ± 0.0068

0.5701 ± 0.0240

0.9859 ± 0.0016

0.9929 ± 0.0008

0.0006

NCSN++

0.3792 ± 0.0068

0.5663 ± 0.0237

0.9861 ± 0.0016

0.9930 ± 0.0008

0.0005

ADM

0.3844 ± 0.0066

0.5739 ± 0.0238

0.9857 ± 0.0016

0.9928 ± 0.0008

0.0005

VAR_TP (Total Precipitation)

VAR_TP performance comparison (most challenging variable)

Architecture

MAE ↓

NMAE ↓

R² ↑

Pearson ↑

KL ↓

DDPM++

0.0001

0.5058 ± 0.0145

0.7182 ± 0.0133

0.8478 ± 0.0079

0.1427

NCSN++

0.0001

0.5105 ± 0.0155

0.7220 ± 0.0148

0.8497 ± 0.0088

0.1326

ADM

0.0001

0.5096 ± 0.0145

0.7176 ± 0.0133

0.8474 ± 0.0079

0.2088

Note: MAE values are in normalized scale; precipitation shows lowest absolute error due to large number of zero in the dataset.

VAR_D2M (2-meter Dewpoint Temperature)

VAR_D2M performance comparison

Architecture

MAE ↓

RMSE ↓

R² ↑

Pearson ↑

KL ↓

DDPM++

0.4515 ± 0.0117

0.7178 ± 0.0221

0.9987 ± 0.0003

0.9994 ± 0.0002

0.0016

NCSN++

0.4450 ± 0.0115

0.7104 ± 0.0222

0.9988 ± 0.0003

0.9994 ± 0.0002

0.0015

ADM

0.4551 ± 0.0117

0.7261 ± 0.0215

0.9987 ± 0.0003

0.9994 ± 0.0002

0.0017

VAR_ST (Skin Temperature)

VAR_ST performance comparison

Architecture

MAE ↓

RMSE ↓

R² ↑

Pearson ↑

KL ↓

DDPM++

0.4785 ± 0.0214

0.8947 ± 0.0485

0.9983 ± 0.0004

0.9991 ± 0.0002

0.0101

NCSN++

0.4759 ± 0.0216

0.8862 ± 0.0485

0.9983 ± 0.0004

0.9991 ± 0.0002

0.0184

ADM

0.4859 ± 0.0215

0.9004 ± 0.0501

0.9982 ± 0.0004

0.9991 ± 0.0002

0.0116

Model Complexity Comparison

Model complexity and efficiency

Architecture

Parameters

Relative Size

Inference Characteristics

DDPM++ (SongUNet)

54.4M

1.0×

Lightweight, fast inference

NCSN++ (SongUNet)

55.1M

1.01×

Slightly larger, Fourier embeddings

ADM (DhariwalUNet)

92.1M

1.69×

Larger model, multi-resolution attention

Key Findings

Best Overall Performance: The NCSN++ architecture achieves the best overall metrics:

  • Lowest loss (0.0517 vs 0.0524 for DDPM++ and 0.0527 for ADM)

  • Lowest MAE (0.3432 vs 0.3458/0.3500)

  • Highest R² (0.9489 vs 0.9482/0.9480)

  • Highest Pearson correlation (0.9725 vs 0.9722/0.9721)

Best Performance for Precipitation: NCSN++ achieves the highest R² for VAR_TP (0.7220) and lowest KL divergence (0.1326), indicating better distribution matching.

Best Performance for Wind Fields: NCSN++ consistently outperforms for both U and V wind components across all metrics.

Most Challenging Variable: Precipitation (VAR_TP) shows the lowest R² scores (0.718-0.722) and highest NMAE (0.506-0.511), reflecting the difficulty of downscaling intermittent precipitation events.

Model Efficiency: The DDPM++ architecture has the fewest parameters (54.4M) while maintaining competitive performance, making it suitable for resource-constrained applications.

Wind Field Anisotropy: Performance is slightly better for U-wind (R² ~0.9888) than V-wind (R² ~0.9861), which may reflect the zonal dominance of atmospheric circulation.

Recommendations

Based on the comprehensive comparison across 6 meteorological variables:

  1. For maximum accuracy: Use NCSN++ (SongUNet with Fourier embeddings)

    • Best overall performance across nearly all metrics

    • Superior handling of precipitation distributions

    • Marginal parameter increase over DDPM++

  2. For balanced performance: Use DDPM++ (SongUNet with positional embeddings)

    • Excellent performance with slightly fewer parameters

    • Competitive across all variables

    • Best for resource-constrained deployment

  3. For temperature-sensitive applications: All three models perform excellently (R² > 0.999), with minimal differences

  4. For precipitation downscaling: NCSN++ is the recommended choice due to superior distribution matching and higher R²

  5. For ensemble applications: Consider all three as they show complementary strengths across different variable types

Note on ADM Performance

While the ADM architecture achieves competitive performance, it underperforms both SongUNet variants despite having nearly 1.7× more parameters. This suggests that:

  • The SongUNet architecture is better suited for downscaling tasks

  • The simplified U-Net design with fewer attention layers may generalize better

  • The additional complexity of ADM does not translate to improved performance for this application