arxiv-summary: AI-summarized AI papers

Silences, Spikes and Bursts: Three-Part Knot of the Neural Code

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Neurons can emit action potentials in different patterns. Electrophysiologists labeled action potentials emitted at a high frequency as “bursts”. The burst coding hypothesis suggests that the neural code has three syllables: silences, spikes and bursts. Evidence is reviewed to support the ternary code in terms of mechanisms for burst generation, synaptic transmission and synaptic plasticity....

Universal Guidance for Diffusion Models

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Proposed algorithm enables diffusion models to be controlled by arbitrary guidance modalities without retraining. Algorithm successfully generates quality images with guidance functions including segmentation, face recognition, object detection, and classifier signals. Paper Content Introduction Diffusion models are powerful tools for creating digital art and graphics. Most models are controlled through conditioning. Guidance is a more flexible approach to controlling model outputs....

Statistically Optimal Force Aggregation for Coarse-Graining Molecular Dynamics

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Machine-learned CG models can simulate large molecular complexes. Training accurate CG models is a challenge. Commonly used mapping methods are inefficient and incorrect. Optimized force maps can lead to improved CG force-fields. Paper Content Introduction Current simulations are limited by computational cost. Coarse-graining is used to reduce the computational burden. Finding a force-field that accurately represents physical interactions is a challenge....

Do Deep Learning Methods Really Perform Better in Molecular Conformation Generation?

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Molecular conformation generation (MCG) is an important problem in drug discovery Traditional methods have been developed to solve MCG, such as systematic searching, model-building, random searching, etc. Recently, deep learning based MCG methods have been developed A simple and cheap algorithm (parameter-free) based on traditional methods is comparable to or even outperforms deep learning based MCG methods Code of the proposed algorithm is available online Paper Content Introduction Molecular conformation generation is important for drug discovery It is related to many drug design tasks Traditional MCG uses conformational search and energy minimization RDKit is a popular cheminformatics software Distance geometry and direct coordinate methods are used Diffusion models are also used Deep learning models are evaluated with Coverage and Matching A simple algorithm based on traditional approaches outperforms deep learning models Related work Classical methods in computational chemistry Development of deep learning Data-driven solutions proposed by researchers Classical methods Traditional MCG paradigm involves conformational search, energy minimization, and energy evaluation Conformational search problem is a combinatorial explosion problem Popular conformational search methods include system search, random search, model-building, distance geometry, and molecular dynamics Energy evaluation methods include force field and electronic structure methods Force field methods are less accurate than electronic structure methods, but are faster Deep learning methods Deep learning methods outperform traditional methods on the GEOM benchmark Earlier work used VAE to generate atomic coordinates directly, but it could not maintain translation and rotation equivariance Later works use intermediate structures such as interatomic distances or torsion angles to generate conformations Diffusion models have been applied to the conformation generation task Method Proposed a method based on RDKit with clustering post-processing Used three samplers to generate diverse and low-energy conformations Applied unsupervised cluster algorithm to select conformations with consideration of diversity and energy Sampled with uniform, geometric, and energy samplers in the ratio of 1:1:4 Experiment Datasets and setup used for benchmarking 10 competitive baselines compared Results show method outperforms most baselines Ablation study conducted to demonstrate more diverse conformations can easily achieve better results Benchmarking should be done according to requirements of downstream applications Conclusion Algorithm outperforms deep learning models Suggest community rethink benchmark in MCG Deep learning can help build effective MCG models RDKit + Clustering algorithm proposed Performance on GEOM-QM9 and GEOM-Drugs Ablation studies for number of samples and sampler type on GEOM-QM9

AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract PLMs need to specialize to specific domains Training an adapter for each domain can be impractical AdapterSoup uses weight-space averaging of adapters trained on different domains AdapterSoup improves performance to new domains without extra training Weight averaging of adapters trained on the same domain preserves performance on new domains Paper Content Introduction Large language models are pre-trained using a lot of data in a self-supervised way To adapt them to a new domain, continuing training with in-domain data is helpful Efficient methods have been proposed to avoid fine-tuning all parameters Weight-space averaging can be used to improve performance on novel domains without extra training AdapterSoup ensembles adapters in the weight space to improve performance on novel domains Text clustering is used to select which adapters to use for each novel domain Weight-space averaging of PLMs adapted to the same domain with varied hyper-parameters can be used to obtain competitive in-domain scores and preserve the generalization ability of a PLM Proposed approach PLM adapted to k domains Model that performs well in novel domain without training more parameters Provenance of text used as proxy for textual domain Assume PLM fine-tuned on single domain Combine fine-tuned models to obtain good in-domain performance and preserve generalization ability Cross-domain adaptersoup An illustration of the cross-domain AdapterSoup is provided in Figure 1....

A modern look at the relationship between sharpness and generalization

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Sharpness of minima can correlate with generalization in deep networks. Reparametrization-invariant sharpness definitions have been proposed. Does it really capture generalization in modern practical settings? Observed that sharpness does not correlate well with generalization. Negative correlation of sharpness with out-of-distribution error. Right sharpness measure is highly data-dependent. Paper Content Introduction Sharpness of training objective has intuitive appeal and appears in generalization bounds Sharpness can correlate well with generalization in deep learning setups Training methods that minimize sharpness have had empirical success Many works suggest flatter minima should generalize better Standard sharpness definitions do not correlate well with generalization Different sharpness definitions can capture different trends Right sharpness measure is highly data-dependent Related work Sharpness of minima is correlated with performance degradation of large-batch SGD Different generalization measures may explain generalization for deep networks Strong correlation between sharpness and generalization on a large set of CIFAR-10/SVHN models Reparametrization-invariant sharpness definitions exist Flat minima can be beneficial for generalization Different criteria optimize for more robust minima Maximum eigenvalue and trace of the Hessian are focus of many works Focus on sharpness-related metrics to better understand generalization for deep networks Background on sharpness Loss on a set of training points is defined as L S (w) Average-case and worst-case m-sharpness are defined Worst-case sharpness is correlated with generalization Adaptive sharpness is invariant under multiplicative reparametrizations Analytical expressions of standard sharpness for radius ρ → 0 depend on first-or second-order terms Strong hypothesis: sharpness is highly correlated with generalization Weak hypothesis: models with lowest sharpness generalize well Kendall rank correlation coefficient is used to detect correlation Adaptive sharpness is invariant for ResNets and ViTs Scale-sensitivity of classification losses is discussed Normalization of logits is proposed to fix scaling issue How to compute worst-case sharpness efficiently?...

A Review of the Role of Causality in Developing Trustworthy AI Systems

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract AI models lack understanding of cause-effect relationships in the real world. AI models do not generalize to unseen data, produce unfair results, and are difficult to interpret. Causal modeling and inference methods have been developed to improve trustworthiness of AI models. Paper Content Causality and robustness Pre-processing methods use causality to create data augmentations Adversarial examples are artificially perturbed input values that can fool machine learning models Data augmentation methods use causal graphs to motivate data augmentation Problem abstraction methods simplify the problem for machine learning agents In-processing methods use causality-aware optimization objectives or architecture design choices Post-processing methods alter predictions or enable causality-informed model selection Causal models can help prevent attacks that expose users’ private information Causality and privacy Pre-processing data augmentation to reduce heterogeneity across user data distribution In-processing using invariant risk minimization to defend against membership inference and property inference attacks Post-processing using test data specific normalization to improve generalization and provide better privacy guarantees Evaluation of success of membership inference attack measured in terms of accuracy and advantage Causal models used to improve privacy by reducing overfitting and providing better differential privacy guarantees Ex-ante impact assessment to identify potential negative effects before deployment Ex-post impact assessment to identify potential negative effects after deployment Ex-ante impact assessment Ex-ante impact assessment predicts risks and impacts of proposed systems Used to assess environmental, financial, social, and human rights ramifications Environmental impact assessment uses deductive and inductive causal inference Social and fiscal impact assessment uses deductive and inductive methods to discover causal relationships Economic impact assessment looks at effects of introducing new economic policies or changing existing ones Ex-post impact assessment Ex-ante impact assessments are limited and often cannot identify all risks and impacts Ex-post impact assessment is used to detect risks and impacts on the go Ex-ante assessments have clear guidelines and metrics for specific types of impacts Ex-post assessments are broader and need to define what constitutes an impact in real-time Causal inference is used to tackle various categories of risks and impacts Temporal and long-term effects can be seen in real-world systems Causality can help assess systems in real-time and find out elements responsible Causality can help identify root cause of system failure and system misuse Strategic risks and effects can be analyzed with causality Causal methods have been demonstrated to be advantageous in healthcare Causality in healthcare through scm framework SCMs are used in healthcare and personalized medicine Causality is used to explain outcomes of medical models Causality is used to discover causal relationships in medical imaging Causality is used to repurpose drugs for new diseases Causality is used to identify causal factors of clinical conditions Causality is used to make AI algorithms fair and robust Causality in healthcare through the po framework PO framework is commonly used in medical field Provides methods to conduct causal analysis from a statistical perspective Removes selection bias in historical data Shi and Norgeot review different research works to estimate treatment effects Used to test if a drug is beneficial or harmful Graham et al....

Concentration Bounds for Discrete Distribution Estimation in KL Divergence

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Estimate discrete distribution in KL divergence Concentration bounds for Laplace estimator Deviation from mean scales as $\sqrt{k}/n$ when $n \ge k$ Establish matching lower bound, tight up to polylogarithmic factors Paper Content Introduction Discrete distribution estimation is a fundamental problem in Statistics Estimating an arbitrary discrete distribution in Kullback-Leibler (KL) divergence with vanishing probability of error is the goal KL divergence is non-negative, unbounded and asymmetric Maximum likelihood estimator (empirical estimator) is commonly used Minimax rates and concentration bounds are typically studied Laplace estimator is commonly used and has a rate of convergence of max Concentration bounds of the form “with probability 1 − δ” are desired McDiarmid’s inequality can be used for 1 distance Best known bound for KL divergence is given by [4] Main result is a bound on the minimax rate of the Laplace estimator Lower bound on the variance of KL(p p1 ) is established Direct consequence of main result is improved sample complexity for estimating a tree-structured Bayesian network Analysis sketch McDiarmid’s inequality is a standard way to provide concentration bounds for discrete distribution estimators Lemma 1 (McDiarmid’s inequality) states that if changing a variable changes the absolute value of the function by at most ci, then with probability at least, the bound simplifies to The goal is to obtain a good enough bound on c∞(f) to show that nc∞(f)2 decreases to 0 [6] observes that the 1 distance between the true distribution and the empirical distribution satisfies c∞ ≤ 2/n A direct application of McDiarmid’s inequality for KL(p p1) results in a vacuous bound To provide a stronger bound, the KL divergence is written as a function of k counts N1, N2, …, Nk The counts Nis are not independent of each other, so the Poisson sampling process is used A concentration bound for KL under Poisson sampling yields a concentration bound for KL under multinomial sampling A high probability bound on c∞(KL) is provided Lemma 2 states that the expectations under multinomial and Poisson sampling are similar A careful coupling between binomial and Poisson random variables with the same mean is used to obtain bounds on the quantity Analysis Combining equations yields result needed for KL divergence result Lemma 4 states that p i = where γ = 311 n + 160k n 3/2 Both i p i = 1 and i p1 i = 1 KL(p p ) is bounded with high probability Jensen’s inequality used to bound right-hand side of equation Lemma 3 and union bound used to bound each term McDiarmid’s inequality applied to function with parameter δ/2 Theorem 3 provides lower bounds on variance of KL divergence of Laplace estimator Corollary 1 shows that √ k dependence is tight Argument in Theorem 3 can be extended to n k regime Conclusion KL divergence between underlying distribution and Laplace estimator can be bounded Previous bound of Õ(k log(1/δ)/n) improved to Õ( √ k log 5/2 (1/δ)/n) Lower bound of Ω( √ k/n) on variance and tail bound of KL loss of Laplace estimator established Heuristic computation of leading constant done by asymptotic expansion of KL divergence As n → ∞, p1 i → p i and Laplace and empirical estimator are similar Chi-squared distribution used to approximate k i=1 (p emp i − 1/k) 2 Poisson and Binomial random variables used to define convenient coupling Standard fact used to upper bound equation Figure 1 shows sample standard deviation vs....

SwitchPrompt: Learning Domain-Specific Gated Soft Prompts for Classification in Low-Resource Domains

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Pre-trained language models are effective for natural language processing tasks, but not for low-resource domains due to the domain gap. SwitchPrompt is a novel and lightweight prompting methodology to bridge the domain gap. SwitchPrompt uses domain-specific keywords with a trainable gated prompt to offer domain-oriented prompting. Few-shot experiments on three text classification benchmarks demonstrate the efficacy of the general-domain pre-trained language models when used with SwitchPrompt....

EspalomaCharge: Machine learning-enabled ultra-fast partial charge assignment

Link to paper The full paper is available here. You can also find the paper on PapersWithCode here. Abstract Atomic partial charges are important for molecular dynamics simulations. Traditionally, partial charges are assigned using quantum chemical methods. A hybrid physical/graph neural network-based approach is proposed to approximate the widely popular AM1-BCC charge model. This hybrid approach is orders of magnitude faster and maintains accuracy comparable to differences in AM1-BCC implementations. The hybrid approach scales linearly with the number of atoms....