Link to paper The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract Explanations of AI models must be both human-intelligible and consistent with the model’s internal structure. Theory of causal abstraction provides the mathematical foundations for these explanations. Contributions include generalizing causal abstraction to cyclic structures, using multi-source interventions, defining approximate causal abstraction, and formalizing XAI methods. Paper Content Introduction XAI seeks to explain why deep learning models make the predictions they do Causal analysis is the gold standard for explaining model behavior and internal reasoning Low-level causal explanations of behavior and internal reasoning can be easily provided, but are not interpretable to humans High-level explanations are easier to interpret, but difficult to trust Causal abstraction provides a framework for analyzing a system at multiple levels of detail simultaneously Causal abstraction has been applied to deep learning AI models, weather patterns, and human brains This paper develops the theory of causal abstraction as a mathematical framework for XAI Low-level variables are partitioned into clusters, each associated with a high-level variable Approximate causal abstraction is explored, connecting interchange intervention analysis with existing definitions Faithful and interpretable causal explanations of ai Causal explanations are privileged when explaining how an artifact works Causal explanations allow for manipulation and control of the system Appropriate level of abstraction is important for causal explanations Intervention is a fundamental operation of causal explanations Causal abstraction supports interpretable explanations of AI Faithfulness is defined as the degree to which an explanation accurately represents the ’true reasoning process behind a model’s behavior' Methods for explaining ai behavior AI model behavior is a function from inputs to outputs Behavior can be represented by a two-variable causal model XAI methods learn interpretable models to approximate uninterpretable models XAI methods are model-agnostic and provide same explanations for models with same behavior Need to ground notions of faithfulness in causality to compare XAI methods Methods for explaining the internal structure of ai AI models have internal reasoning that can be represented as a program or algorithm Recent research aims to understand the causal mechanisms inside black box models Causal abstraction provides a mathematical foundation for understanding the high-level semantics of neural representations Interchange interventions are used to show that neural representations represent propositional content Iterative nullspace projection is used to evaluate whether neural representations encode concepts with ‘mental’ causes and effects Causal mediation analysis is used to analyze gender bias in pretrained language models Circuit-based explanations reverse engineer the mechanisms of a network at the level of individual neurons Probing is used to determine whether a concept is present in a neural representation Feature attribution methods ascribe scores to neural representations to capture their ‘impact’ on model behavior Causal models Notation: V denotes a set of variables, X denotes a variable, x denotes a value, Val(X) denotes the range of possible values for X No two variables can take on the same value Capital letters denote variables, lower case letters denote values, bold letters denote sets of variables/values Domain(f), Uniform(X), ½[ϕ] are useful constructs Projection: given a partial setting u for a set of variables U, Proj(u, X) is the restriction of u to the variables in X Definition 4: causal model is a pair (V, F) where V is a set of variables and F is a set of structural functions Remark 5: no explicit reference to a graphical structure defining a causal ordering on the variables Remark 6: acyclic model notation Definition 7: set of solutions is the set of all v ∈ Val(V) such that all equations v = f V (v) are satisfied Definition 8: intervention is a partial setting i ∈ Val(I) for I ⊆ V, M i is just like M except f X is replaced with constant function v → Proj(i, X) for each X ∈ I Example of causal models: a symbolic algorithm and neural network Two causal models are defined to demonstrate potential to model a variety of computational processes The first model is a tree-structured algorithm The second model is a fully-connected feed-forward neural network Both models solve the same task Hierarchical equality task Hierarchical equality task is to determine if two pairs of objects have identical relations Input is two pairs of objects, output is True if both pairs are equal or unequal, False otherwise Domain of objects consists of triangle, square, and pentagon Obvious tree-structured symbolic algorithm solves the task Equality reasoning is ubiquitous and has been studied for broader questions about relational reasoning Hierarchical equality serves as a case study for explaining how abstract tree-structured composition can be implemented by a fully-connected neural network Neural network is trained to implement the hierarchical equality task A tree-structured algorithm for hierarchical equality Algorithm A consists of four input variables and one output variable Acyclic causal graph is depicted in Figure 1a Each f Xi is a constant function Default total setting is [ , , , , True, True, True] Counterfactual result is [ , , △, , True, True, True] A fully connected neural network for hierarchical equality Neural network N consists of 8 input neurons Values for each variable are real numbers R 4 sets of variables for first 4 layers Constant function f R k for 1 ≤ k ≤ 8 Output neurons determined by network weights Network outputs True/False based on output logit values Causal abstraction and interchange intervention analysis Structural conditions must be in place for H to be a high-level abstraction of the low-level model L N and A must be present from the previous section Alignments between causal models Abstraction involves associating high-level variables with clusters of low-level variables Alignment between low-level and high-level causal models is introduced Alignment consists of a partition and a family of maps Alignment induces a unique translation Translation is a partial function from low-level interventions to high-level interventions Low-level interventions that correspond to high-level interventions are defined by cell-wise maps Causal consistency and constructive abstraction Definition 10: An alignment between two models is consistent if the high-level intervention corresponding to a low-level intervention results in the same high-level total settings....