Benchmark
=========

This section documents the experimental results comparing three different U-Net architectures for downscaling within the IPSL-AID framework.

Model Performance Comparison
----------------------------

This presents a comprehensive comparison of three U-Net architectures trained for statistical downscaling of atmosphericvariables. All models were trained with identical hyperparameters.

Experiment Configuration
~~~~~~~~~~~~~~~~~~~~~~~~

All models were trained with the following common configuration:

- **Dataset**: ERA5 reanalysis data (2015-2019 training, 2020 validation)
- **Domain**: Global
- **Input resolution**: 721 × 1440 (0.25° grid)
- **Input channels**: 10 (6 target variables + 4 constants)
- **Output channels**: 6 (downscaled meteorological variables)
- **Target variables**:

  - VAR_2T: 2-meter temperature (K)
  - VAR_10U: 10-meter U wind component (m/s)
  - VAR_10V: 10-meter V wind component (m/s)
  - VAR_TP: Total precipitation (m/h)
  - VAR_D2M: 2-meter dewpoint temperature (K)
  - VAR_ST: Skin temperature (K)

- **Normalization**: Standard scaling (log1p for precipitation)
- **Time encoding**: Sine/cosine of day-of-year (4 channels)
- **Constant variables**: Orography (z) and land-sea mask (lsm)
- **Loss function**: UNet diffusion loss (MSE-based)
- **Learning rate**: 0.0001
- **Batch size**: 36 (12 spatial × 1460 temporal)
- **Epochs**: 20
- **Spatial batching**: 12 tiles
- **Temporal batching**: 1460 time steps

Model Architectures
~~~~~~~~~~~~~~~~~~~

Three U-Net variants were evaluated:

1. **DDPM++ (SongUNet - Positional embedding)**

   - Denoising Diffusion Probabilistic Model architecture
   - Positional timestep embedding
   - Standard encoder/decoder with skip connections
   - Channel multiplier: [2, 2, 2]
   - Base channels: 128
   - Resampling filter: [1, 1]
   - **Parameters**: 54,429,958

2. **NCSN++ (SongUNet - Fourier embedding)**

   - Noise-Conditioned Score Network architecture
   - Fourier feature timestep embedding
   - Residual encoder with skip connections
   - Channel multiplier: [2, 2, 2]
   - Base channels: 128
   - Resampling filter: [1, 3, 3, 1]
   - **Parameters**: 55,109,510

3. **ADM (DhariwalUNet)**

   - Ablated Diffusion Model architecture
   - Multi-resolution attention (32, 16, 8)
   - Channel multiplier: [1, 2, 3, 4]
   - Base channels: 128
   - Number of blocks: 2
   - **Parameters**: 92,140,550

Performance Metrics
~~~~~~~~~~~~~~~~~~~

The following metrics were used for evaluation on the validation set (year 2020):

- **Loss**: UNet diffusion loss value
- **MAE**: Mean Absolute Error (normalized scale)
- **NMAE**: Normalized Mean Absolute Error (normalized by variable range)
- **RMSE**: Root Mean Square Error (normalized scale)
- **R²**: Coefficient of determination
- **Pearson**: Pearson correlation coefficient
- **KL**: KL divergence (distribution similarity)

Quantitative Comparison - Overall Metrics
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table:: Performance comparison for all variables (validation set)
   :header-rows: 1
   :widths: 25, 15, 15, 15, 15, 15, 15
   :align: center

   * - Architecture
     - Loss ↓
     - MAE ↓
     - NMAE ↓
     - RMSE ↓
     - R² ↑
     - Pearson ↑
   * - DDPM++ (SongUNet)
     - 0.0524
     - 0.3458 ± 0.0069
     - 0.1170 ± 0.0035
     - 0.5604 ± 0.0164
     - 0.9482 ± 0.0025
     - 0.9722 ± 0.0014
   * - NCSN++ (SongUNet)
     - 0.0517
     - 0.3432 ± 0.0068
     - 0.1176 ± 0.0037
     - 0.5552 ± 0.0159
     - 0.9489 ± 0.0027
     - 0.9725 ± 0.0016
   * - ADM (DhariwalUNet)
     - 0.0527
     - 0.3500 ± 0.0071
     - 0.1179 ± 0.0035
     - 0.5656 ± 0.0166
     - 0.9480 ± 0.0023
     - 0.9721 ± 0.0013

*Note: ↓ indicates lower is better, ↑ indicates higher is better. Values show mean ± std across spatial batches.*

Baseline Comparison
~~~~~~~~~~~~~~~~~~~

For reference, coarse input (bilinear interpolation of low-resolution input) metrics are provided:

.. list-table:: Baseline coarse input performance (all variables)
   :header-rows: 1
   :widths: 20, 15, 15, 15, 15, 15
   :align: center

   * - Baseline
     - MAE
     - NMAE
     - RMSE
     - R²
     - Pearson
   * - Coarse Input
     - 0.6993 ± 0.0214
     - 0.1981 ± 0.0045
     - 1.2208 ± 0.0368
     - 0.8873 ± 0.0039
     - 0.9377 ± 0.0028

**Improvement over baseline**: All three U-Net architectures achieve significant improvements, reducing MAE by approximately 50% and increasing R² from 0.887 to 0.948+.

Per-Variable Performance
~~~~~~~~~~~~~~~~~~~~~~~~

VAR_2T (2-meter Temperature)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table:: VAR_2T performance comparison
   :header-rows: 1
   :widths: 25, 15, 15, 15, 15, 15
   :align: center

   * - Architecture
     - MAE ↓
     - RMSE ↓
     - R² ↑
     - Pearson ↑
     - KL ↓
   * - DDPM++
     - 0.3697 ± 0.0095
     - 0.5873 ± 0.0171
     - 0.9992 ± 0.0001
     - 0.9996 ± 0.0001
     - 0.0010
   * - NCSN++
     - 0.3684 ± 0.0097
     - 0.5816 ± 0.0168
     - 0.9992 ± 0.0001
     - 0.9996 ± 0.0001
     - 0.0011
   * - ADM
     - 0.3775 ± 0.0103
     - 0.5968 ± 0.0176
     - 0.9992 ± 0.0001
     - 0.9996 ± 0.0001
     - 0.0007

VAR_10U (10-meter U Wind)
^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table:: VAR_10U performance comparison
   :header-rows: 1
   :widths: 25, 15, 15, 15, 15, 15
   :align: center

   * - Architecture
     - MAE ↓
     - RMSE ↓
     - R² ↑
     - Pearson ↑
     - KL ↓
   * - DDPM++
     - 0.3938 ± 0.0071
     - 0.5921 ± 0.0237
     - 0.9886 ± 0.0014
     - 0.9943 ± 0.0007
     - 0.0006
   * - NCSN++
     - 0.3905 ± 0.0069
     - 0.5867 ± 0.0228
     - 0.9888 ± 0.0013
     - 0.9944 ± 0.0007
     - 0.0005
   * - ADM
     - 0.3966 ± 0.0071
     - 0.5960 ± 0.0227
     - 0.9885 ± 0.0014
     - 0.9942 ± 0.0007
     - 0.0005

VAR_10V (10-meter V Wind)
^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table:: VAR_10V performance comparison
   :header-rows: 1
   :widths: 25, 15, 15, 15, 15, 15
   :align: center

   * - Architecture
     - MAE ↓
     - RMSE ↓
     - R² ↑
     - Pearson ↑
     - KL ↓
   * - DDPM++
     - 0.3813 ± 0.0068
     - 0.5701 ± 0.0240
     - 0.9859 ± 0.0016
     - 0.9929 ± 0.0008
     - 0.0006
   * - NCSN++
     - 0.3792 ± 0.0068
     - 0.5663 ± 0.0237
     - 0.9861 ± 0.0016
     - 0.9930 ± 0.0008
     - 0.0005
   * - ADM
     - 0.3844 ± 0.0066
     - 0.5739 ± 0.0238
     - 0.9857 ± 0.0016
     - 0.9928 ± 0.0008
     - 0.0005

VAR_TP (Total Precipitation)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table:: VAR_TP performance comparison (most challenging variable)
   :header-rows: 1
   :widths: 25, 15, 15, 15, 15, 15
   :align: center

   * - Architecture
     - MAE ↓
     - NMAE ↓
     - R² ↑
     - Pearson ↑
     - KL ↓
   * - DDPM++
     - 0.0001
     - 0.5058 ± 0.0145
     - 0.7182 ± 0.0133
     - 0.8478 ± 0.0079
     - 0.1427
   * - NCSN++
     - 0.0001
     - 0.5105 ± 0.0155
     - 0.7220 ± 0.0148
     - 0.8497 ± 0.0088
     - 0.1326
   * - ADM
     - 0.0001
     - 0.5096 ± 0.0145
     - 0.7176 ± 0.0133
     - 0.8474 ± 0.0079
     - 0.2088

*Note: MAE values are in normalized scale; precipitation shows lowest absolute error due to large number of zero in the dataset.*

VAR_D2M (2-meter Dewpoint Temperature)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table:: VAR_D2M performance comparison
   :header-rows: 1
   :widths: 25, 15, 15, 15, 15, 15
   :align: center

   * - Architecture
     - MAE ↓
     - RMSE ↓
     - R² ↑
     - Pearson ↑
     - KL ↓
   * - DDPM++
     - 0.4515 ± 0.0117
     - 0.7178 ± 0.0221
     - 0.9987 ± 0.0003
     - 0.9994 ± 0.0002
     - 0.0016
   * - NCSN++
     - 0.4450 ± 0.0115
     - 0.7104 ± 0.0222
     - 0.9988 ± 0.0003
     - 0.9994 ± 0.0002
     - 0.0015
   * - ADM
     - 0.4551 ± 0.0117
     - 0.7261 ± 0.0215
     - 0.9987 ± 0.0003
     - 0.9994 ± 0.0002
     - 0.0017

VAR_ST (Skin Temperature)
^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table:: VAR_ST performance comparison
   :header-rows: 1
   :widths: 25, 15, 15, 15, 15, 15
   :align: center

   * - Architecture
     - MAE ↓
     - RMSE ↓
     - R² ↑
     - Pearson ↑
     - KL ↓
   * - DDPM++
     - 0.4785 ± 0.0214
     - 0.8947 ± 0.0485
     - 0.9983 ± 0.0004
     - 0.9991 ± 0.0002
     - 0.0101
   * - NCSN++
     - 0.4759 ± 0.0216
     - 0.8862 ± 0.0485
     - 0.9983 ± 0.0004
     - 0.9991 ± 0.0002
     - 0.0184
   * - ADM
     - 0.4859 ± 0.0215
     - 0.9004 ± 0.0501
     - 0.9982 ± 0.0004
     - 0.9991 ± 0.0002
     - 0.0116

Model Complexity Comparison
~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table:: Model complexity and efficiency
   :header-rows: 1
   :widths: 25, 20, 20, 30
   :align: center

   * - Architecture
     - Parameters
     - Relative Size
     - Inference Characteristics
   * - DDPM++ (SongUNet)
     - 54.4M
     - 1.0×
     - Lightweight, fast inference
   * - NCSN++ (SongUNet)
     - 55.1M
     - 1.01×
     - Slightly larger, Fourier embeddings
   * - ADM (DhariwalUNet)
     - 92.1M
     - 1.69×
     - Larger model, multi-resolution attention

Key Findings
~~~~~~~~~~~~

**Best Overall Performance**: The **NCSN++** architecture achieves the best overall metrics:

- Lowest loss (0.0517 vs 0.0524 for DDPM++ and 0.0527 for ADM)
- Lowest MAE (0.3432 vs 0.3458/0.3500)
- Highest R² (0.9489 vs 0.9482/0.9480)
- Highest Pearson correlation (0.9725 vs 0.9722/0.9721)

**Best Performance for Precipitation**: **NCSN++** achieves the highest R² for VAR_TP (0.7220) and lowest KL divergence (0.1326), indicating better distribution matching.

**Best Performance for Wind Fields**: **NCSN++** consistently outperforms for both U and V wind components across all metrics.

**Most Challenging Variable**: Precipitation (VAR_TP) shows the lowest R² scores (0.718-0.722) and highest NMAE (0.506-0.511), reflecting the difficulty of downscaling intermittent precipitation events.

**Model Efficiency**: The **DDPM++** architecture has the fewest parameters (54.4M) while maintaining competitive performance, making it suitable for resource-constrained applications.

**Wind Field Anisotropy**: Performance is slightly better for U-wind (R² ~0.9888) than V-wind (R² ~0.9861), which may reflect the zonal dominance of atmospheric circulation.

Recommendations
~~~~~~~~~~~~~~~

Based on the comprehensive comparison across 6 meteorological variables:

1. **For maximum accuracy**: Use **NCSN++** (SongUNet with Fourier embeddings)

   - Best overall performance across nearly all metrics
   - Superior handling of precipitation distributions
   - Marginal parameter increase over DDPM++

2. **For balanced performance**: Use **DDPM++** (SongUNet with positional embeddings)

   - Excellent performance with slightly fewer parameters
   - Competitive across all variables
   - Best for resource-constrained deployment

3. **For temperature-sensitive applications**: All three models perform excellently (R² > 0.999), with minimal differences

4. **For precipitation downscaling**: **NCSN++** is the recommended choice due to superior distribution matching and higher R²

5. **For ensemble applications**: Consider all three as they show complementary strengths across different variable types

Note on ADM Performance
~~~~~~~~~~~~~~~~~~~~~~~

While the ADM architecture achieves competitive performance, it underperforms both SongUNet variants despite having nearly 1.7× more parameters. This suggests that:

- The SongUNet architecture is better suited for downscaling tasks
- The simplified U-Net design with fewer attention layers may generalize better
- The additional complexity of ADM does not translate to improved performance for this application