arxiv-summary: AI-summarized AI papers

Causal isotonic calibration for heterogeneous treatment effects

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Proposed a new method for calibrating predictors of heterogeneous treatment effects Introduced a data-efficient variant of calibration that avoids the need for hold-out calibration sets Established that proposed method achieves fast doubly-robust calibration rates Wrapping proposed method around any black-box learning algorithm provides strong calibration guarantees while preserving predictive performance Paper Content Introduction Estimation of causal effects is important for understanding interventions and informing policy Treatment effect heterogeneity can provide more insights than overall population effects Applications of HTEs include prioritizing treatment and individualizing treatment assignments CATE estimation is of great interest in statistics and data science CATE estimators build upon estimators of the conditional mean outcome and the probability of treatment given covariates CATE estimation can be challenging due to non-smooth, high-dimensional nuisance parameters Predictions from a given treatment effect predictor can still be useful for decision-making Theoretical guarantees for rational decision-making typically hinge on the predictor being a good approximation of the true CATE Calibration is a desirable property of a treatment effect predictor Calibration has been widely used to enhance prediction models for classification and regression Little research has been done on calibration of treatment effect predictors This paper proposes a nonparametric doubly-robust method for calibrating treatment effect predictors Statistical setup Notation and definitions Data unit O consists of three components: W, A, and Y W is a vector of baseline covariates A is a binary indicator of treatment Y is an outcome Dn is the observed dataset π0 is the propensity score µ0 is the potential outcome Higher values of Y1-Y0 are desirable τ0 is the true CATE γ0 is the conditional mean of the individual treatment effect Solution to isotonic regression problem is non-unique Solution follows Groeneboom and Lopuhaa (1993) Measuring calibration and the calibration-distortion decomposition Various definitions of risk predictor calibration have been proposed Outline definition of calibration and rationale Best predictor of individual treatment effect is w → γ 0 (τ, w) Perfect calibration cannot be achieved in finite samples Calibration measure is 2 -expected calibration error Calibration measure plays role in mean squared error between treatment predictor and true CATE Calibration-distortion decomposition shows better-calibrated treatment effect predictors have lower mean-squared error Calibrating predictors: desiderata and classical methods Calibration methods aim to find a function that makes a predictor more accurate Platt’s scaling is used for binary outcomes and is based on strong parametric assumptions Histogram binning partitions the sorted values of the predictor into a fixed number of bins Bayesian binning considers multiple binning models and their combinations Isotonic calibration learns the bins from data using isotonic regression Isotonic calibration satisfies a distribution-free calibration guarantee and is at least as predictive as the original predictor Causal isotonic calibration Inspired by isotonic calibration, a doubly-robust calibration method for treatment effects is proposed, called causal isotonic calibration Takes a given predictor trained on some dataset and performs calibration using an independent (or hold-out) dataset Automatically learns uncalibrated regions of the given predictor Consolidates individual predictions within each region into a single value using a doubly-robust estimator of the ATE Introduces a novel data-efficient variant of calibration called crosscalibration Cross-fitted predictors are used and a single calibrated predictor is obtained using all available data Implemented using standard isotonic regression software Estimate χ 0 of χ 0 is obtained using E m Isotonic regression is used to find and refer to χ 0 (O) as a pseudo-outcome Calibrated predictor is given by θ n τ Sample splitting or cross-fitting is recommended to obtain pseudo-outcomes Algorithm 2 provides a means to fully utilize the entire dataset for both fitting an initial estimate of τ 0 and calibration Algorithm 3 is a computationally simpler variant of Algorithm 2 Sample theoretical properties Algorithm 1 and Algorithm 2 are presented for causal isotonic calibration Properties 1 and 2 are argued to be satisfied Data is split into a training dataset and a calibration dataset Conditions 1-5 are assumed Theorem 1 establishes the calibration rate of the calibrated predictor Theorem 2 states that the pointwise median preserves calibration Theorem 3 states that the mean squared error is not inflated much Data-generating mechanisms Examined behavior of proposal under two data-generating mechanisms Scenario 1: binary outcome, 4 confounders, treatment interactions Scenario 2: continuous outcome, linear on covariates, 20 true confounders Propensity score follows logistic regression model Covariates independent and uniformly distributed on (-1, +1) Sample sizes of 1,000, 2,000, 5,000 and 10,000 Cate estimation Implemented GBRT, RF, GLMnet, GAM, and MARS for Scenario 1 Implemented RF, GLMnet, and combination of variable screening with lasso regularization for Scenario 2 Used R package sl3 for implementation of estimators Used causal isotonic cross-calibration for calibration Performance metrics Compared performance of calibrated and uncalibrated versions of a causal isotonic calibrator Used 3 metrics to compare performance: calibration measure, mean squared error, and calibration bias within bins Estimated metric empirically using an independent sample of size 10,000 Averaged metric estimates across 500 simulations Simulation results GLMnet, RF, GAM, and MARS were well-calibrated and did not benefit from calibration GBRT benefited from calibration, reducing calibration error and improving mean squared error RF and GBRT with GLMnet screening were poorly calibrated and benefited from calibration Cross-calibration improved mean squared error and calibration more than conventional calibration Conclusion Proposed causal isotonic calibration as a novel method to calibrate treatment effect predictors Established that the pointwise median of calibrated predictors is also calibrated Developed a data-efficient variant of causal isotonic calibration using cross-fitted predictors Calibration error vanishes at a fast rate of -2/3 with little or no loss in predictive power Directly calibrate HTE predictors without requiring trial data or parametric assumptions Potential applications of method include data-driven decision-making with strong robustness guarantees Limitations: need to estimate µ 0 or π 0 sufficiently well Found that calibration generally preserves predictive power, and in some cases improves accuracy Found that cross-calibration substantially improved mean squared error Theoretical arguments can be adapted to provide guarantees for isotonic calibration in regression and classification problems Implementation of algorithms in R provided in Github package causalCalibration

Optimistic Planning by Regularized Dynamic Programming

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Proposed a new method for optimistic planning in infinite-horizon discounted Markov decision processes Method adds regularization to updates of approximate value iteration procedure Allows use of approximate transition functions estimated via least-squares procedures in MDPs with linear function approximation Provides computationally efficient algorithm for learning near-optimal policies in discounted linear kernel MDPs from single stream of experience Achieves near-optimal statistical guarantees Paper Content Introduction The idea of constructing a confidence set of statistically plausible models and picking a policy that maximizes the expected return can be traced back to Lai & Robbins (1985)....

LLaMA: Open and Efficient Foundation Language Models

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. Trained on trillions of tokens using publicly available datasets. LLaMA-13B outperforms GPT-3 (175B) on most benchmarks. LLaMA-65B is competitive with the best models. Release all models to research community. Paper Content Introduction LLMs trained on large corpora of texts can perform new tasks from instructions or examples Scaling models to a sufficient size results in few-shot properties More parameters does not always lead to better performance Objective of scaling laws is to determine how to best scale dataset and model sizes for a particular training compute budget Focus of this work is to train a series of language models that achieve best possible performance at various inference budgets Models range from 7B to 65B parameters with competitive performance compared to best existing LLMs Training approach is similar to methods described in previous work and is inspired by Chinchilla scaling laws Pre-training data Training dataset is a mixture of several sources Data sources are publicly available and compatible with open sourcing 67% of dataset is English CommonCrawl Preprocessed with CCNet pipeline 4....

Inseq: An Interpretability Toolkit for Sequence Generation Models

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Past work in natural language processing interpretability focused mainly on classification tasks and overlooked generation settings. Inseq is a Python library to make interpretability analyses of sequence generation models easier. Inseq can be used to highlight gender biases in machine translation models and locate factual knowledge inside GPT-2. Inseq’s extensible interface supports cutting-edge techniques and can drive advances in explainable natural language generation....

ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Proposed a learning-based encoder for fast and accurate concept customization Consists of global and local mapping networks Global mapping network projects hierarchical features of image into multiple “new” words in textual word embedding space Local mapping network injects encoded patch features into cross attention layers to provide omitted details Compares method with prior optimization-based approaches on user-defined concepts Demonstrates faster encoding process with more high-fidelity inversion and robust editability Paper Content Introduction Large-scale diffusion models demonstrate impressive superiority in text-to-image generation Applied to various tasks such as image editing, data augmentation, and artistic creation Customized text-to-image generation aims to learn a specific concept from a small set of user-provided images Existing methods usually adopt the per-concept optimization formulation, which requires several or tens of minutes to learn a single concept Proposed a learning-based encoder to encode visual concepts into textual embeddings Global mapping network maps CLIP image features into the textual word embedding space Local mapping network encodes CLIP features into the textual feature space Experiments demonstrate that the method can encode the target concept efficiently and faithfully Related work Text-to-image generation Deep generative models have been successful in text-conditioned image generation Models can be categorized into three groups: GAN-based, VAE-based, and diffusion-based Diffusion-based models demonstrate high-quality and controllable imaginary generation GLIDE introduces diffusion models into text-to-image generation Diffusion models struggle to express specific or user-defined concepts Gan inversion GAN inversion is a way to project real images into latent codes There are two types of GAN inversion algorithms: optimization-based and encoder-based Optimization-based methods require many iterations, while encoder-based methods only require one feed-forward pass ELITE proposes a local mapping network to improve details consistency Diffusion-based inversion Text-to-image diffusion models can be inverted in two types of latent spaces: Textual Word Embedding (TWE) space and Imagebased Noise Map (INM) space....

The Role of Pre-training Data in Transfer Learning

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Transfer learning produces high-accuracy models Pre-training size affects transfer learning performance Choice of pre-training data source is essential for few-shot transfer Label noise and size of pre-training dataset have trade-offs Language-image contrastive vs. image-image contrastive pre-training methods have different effects on downstream accuracy Paper Content Introduction Transfer learning is a popular computer vision model production method Pre-trained models have improved in recent years Research question: How do pre-training dataset and algorithm affect downstream performance?...

OccDepth: A Depth-Aware Method for 3D Semantic Scene Completion

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract 3D Semantic Scene Completion (SSC) can provide dense geometric and semantic scene representations for autonomous driving and robotic systems. Depth information is necessary for 3D geometry restoration. A stereo SSC method named OccDepth is proposed to exploit implicit depth information from stereo images or RGBD images. A reformed TartanAir benchmark, named SemanticTartanAir, is provided for testing OccDepth on SSC task....

Navigating the Grey Area: Expressions of Overconfidence and Uncertainty in Language Models

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Language models lack the ability to interpret and generate expressions of uncertainty. Expressions of uncertainty are important for human decision-making. GPT3’s accuracy varies depending on the expression of uncertainty used. Models struggle to emit trustworthy expressions of uncertainty. Paper Content Introduction Natural language systems need to communicate uncertainties Expressions of uncertainty inform decision-making Prior work has focused on mapping internal probabilities to verbal/numerical output Our work focuses on non-unidimensional linguistic features (hedges, epistemic markers, etc....

Jointly Optimizing Translations and Speech Timing to Improve Isochrony in Automatic Dubbing

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Automatic dubbing is the task of translating original speech in a video into a target language. The target language speech should match the timing of the original video, including mouth movements, pauses, and hand gestures. This paper proposes a model that optimizes both the translation and the speech duration of the generated translations....

SplineCam: Exact Visualization and Characterization of Deep Network Geometry and Decision Boundaries

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Current Deep Network visualization and interpretability methods rely on data space visualizations. SplineCam is the first provably exact method for computing the geometry of a DN’s mapping. SplineCam applies to any DN architecture based on CPWL nonlinearities. SplineCam enables comparison of architectures, measuring generalizability and sampling from the decision boundary. Paper Content Introduction Deep learning and in particular Deep Networks (DNs) have redefined machine learning and pattern recognition DNs employ a variety of techniques to improve performance DNs consist of sequentially mapping an input vector to a sequence of feature maps Weight matrix, bias vector and activation operator control the type of layer Rectified Linear Unit (ReLU) is a popular choice for activation operator Interpreting the geometry of a DN is a nontrivial task Activation based interpretability methods can be susceptible to feature adversarial attacks Finding closest point to a training sample that lies on the model’s decision boundary is an empirical method for model interpretation Continuous Piece-Wise Linear (CPWL) activation functions are used in DNs SplineCam is a sampling-free method to compute the exact partition of a DN SplineCam can visualize a DN’s input space partition, compute partition statistics and sample from the decision boundary The exact geometry and decision boundary of continuous piece-wise linear deep networks Deep networks as continuous piece-wise linear operators Spline operators are a form of nonlinear function Each region of the input space has a degree P polynomial The first P-1 derivatives of the polynomials are continuous DNs with CPA activation can be expressed as a spline Spline theory has been used in approximation theory, optimal control, and statistics Exact computation of their partition and decision boundary Suppose w and b are rows of W and b....