Link to paper The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract XAI methods can explain predictions of DNNs XAI methods have been applied in climate science Missing ground truth explanations complicate evaluation and validation of XAI methods This work introduces XAI evaluation in the context of climate research Evaluation properties assessed: robustness, faithfulness, randomization, complexity, localization MLP and CNN trained to predict decade based on temperature maps Multiple XAI methods applied and performance quantified for each evaluation property XAI methods Integrated Gradients, Layer-wise relevance propagation, and InputGradients show considerable robustness, faithfulness, and complexity Explanations using input perturbations do not improve robustness and faithfulness Paper Content Introduction Deep learning is used in climate science for tasks such as nowcasting, monitoring, forecasting, model enhancement, and upsampling of satellite data Deep neural networks are considered a black box and lack transparency Explainable artificial intelligence (XAI) can validate DNNs and provide researchers with new insights XAI can be categorized using three aspects: local/global decision-making, self-explaining models, and model-aware/model-agnostic methods Output of XAI can differ in terms of meaning XAI evaluation quantitatively assesses the reliability of an explanation XAI evaluation properties include robustness, complexity, localization, randomization, and faithfulness Workflow includes training a model, applying XAI methods, and using XAI evaluation to compare and rank methods Evaluation metrics are assessed for compatibility with climate data properties Guideline established for using XAI evaluation to choose an optimal explanation Data and methods Data Data is simulated by the general climate model, CESM1 Data consists of 40 ensemble members Data is global 2-m air temperature maps from 1920 to 2080 Data is processed by computing annual averages and applying a bilinear interpolation Data is standardized by removing the multi-year 1920-2080 mean and dividing by the corresponding standard deviation Networks MLP and CNN are trained to solve a fuzzy classification problem MLP takes flattened temperature maps as input MLP assigns each map to one of 20 different classes Regression is used to predict the year of the input MLP and CNN have comparable number of parameters Datasets include a training and test set, 80% of data is split into training and validation set Explainable artificial intelligence (xai) Model-aware explanation methods in climate science are presented Model-agnostic explanation methods are not considered due to computational time Gradient/Saliency explains network decision by computing first partial derivative of output with respect to input InputGradient extends information content towards input image Integrated Gradients introduces baseline datapoint and computes explanation based on difference to baseline Layerwise Relevance Propagation computes relevance for each input feature by feeding networkās prediction backwards SmoothGrad, NoiseGrad, and FusionGrad perturb input features and/or network weights to account for uncertainties Evaluation techniques XAI research has developed metrics to assess different properties of explanation methods Five different evaluation properties have been analyzed, based on a classification task from Labe and Barnes [2021] Faithfulness Table 5 refers to a perturbation function called āIndicesā āIndicesā refers to the replacement of the highest value pixels in the explanation āLinearā refers to noisy linear imputation Randomisation Calculations for MPT score use ābottom upā approach from output layer to input layer Pearson correlation used as similarity function for both metrics Top-k considers 10% most relevant pixels of all pixels in temperature map Hyperparameters of XAI methods and evaluation metrics reported in Tables 4 and 5 respectively Maximum and minimum values of temperature maps in dataset denoted as xmax and xmin Localisation Quality of an explanation is measured based on agreement with user-defined region of interest Localization metrics assume that ROI should be mainly responsible for network decision Top-k-pixel and relevance-rank-accuracy are used to measure localization Complexity assesses how evidence values are distributed across explanation map Complexity Complexity is a measure of conciseness Explanations should consist of few strong features Complexity and sparseness are used as metric functions Low entropy is desirable Network predictions, explanations and motivating example Evaluated network performance and discussed application of explanation methods for both network architectures Fixed hyperparameters and fuzzy classification setup for MLP and CNN during training MLP and CNN have similar performance compared to primary publication Classification accuracy of both networks agrees within error bounds Calculated explanation maps for all temperature maps correctly predicted Applied XAI methods to explain predictions of both MLP and CNN Different XAI methods provide different relevances Assessment of xai metrics Evaluated XAI evaluation properties for classification task on MLP Analyzed two representative metrics for each property Based analysis on three criteria: coherence, score stability, and information value Provided artificial random explanation baseline for each metric Robustness metrics pass random baseline test LRP-Ī±-Ī² has highest robustness scores FusionGrad and NoiseGrad have lowest robustness scores AS and LLE scores do not align FC passes random baseline test, ROAD scores of NoiseGrad and FusionGrad overlap with random baseline MPT and RL metrics evaluated, random baseline has lowest scores Complexity and Sparseness metrics evaluated, LRP-Ī±-Ī² has highest complexity score, InputGradients and LRP-z have highest sparseness scores Localization metrics evaluated, FusionGrad has highest score, all other explanation methods have lower but similar scores Network-based comparison MLP and CNN networks compared using one metric per property Challenges in defining meaningful ROI for localization and defining localization as an explanation property Table 3 displays results for both networks across all properties Similarities in ranking across every category, but differences in localization and complexity due to structural differences in learned patterns Input contribution methods (Integrated Gradients, Input-Gradients, LRP) best in faithfulness, robustness, and complexity Gradient-based methods (Gradient, SmoothGrad, NoiseGrad, FusionGrad) best in randomization LRP-Ī±-Ī² and LRP-composite low rankings in faithfulness category Explanation-enhancing procedures (SmoothGrad, Integrated Gradients, FusionGrad, NoiseGrad) no improvement of explanation performance Spyder plot (Table 3 and Figure 8) used to determine best-performing XAI method Choosing a xai method XAI evaluation can be used to select an appropriate XAI method....