Link to paper The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract Mining large corpora is time-consuming for humans Formulated a new task, D5, to automatically discover differences between two large corpora Task input is a research goal and a corpus pair Output is a language description of how the corpora differ Built a D5 system and contributed a meta-dataset and proposed unified evaluation metrics Confirmed language models can use goals to propose relevant, novel, and significant discoveries System produces discoveries previously unknown to the authors on a wide range of applications Paper Content Introduction Processes of generating discoveries from large corpora are ad hoc and laborious Machine learning can potentially accelerate these discovery processes Formulated one family of these processes as an ML task with unified metrics and input-output space Task is goal driven discovery of differences between text distributions via language descriptions Input is a “problem” comprising a description of the research goal and a corpus pair Output is a “discovery” represented as a natural language predicate Evaluate a discovery using two categories of criteria: validity and meaningfulness Curate OPEND5, a meta-dataset with 675 open-ended D5 problems Built a D5 system to tackle problems in OPEND5 System produces valid and meaningful discoveries in natural language as outputs Evaluated system and found it produces relevant hypotheses more often than baseline Automate discoveries, train better D5 systems, and analyze limitations of evaluation OPEND5 allows benchmarking, automation, analysis, and learning of D5 task Evaluation Evaluate system-generated discovery by determining if more samples from Corpus A satisfy the predicate Subjective judgement needed to determine if discovery is meaningful to research goal of understanding side effects Validity Requires output discovery h to be a truth predicate on a text sample Define T (h, x) as certainty that h is true on x Approximate T (h, x) by asking three Turkers and averaging responses Define validity V as mean of T (h, x) on subset from Corpus A and B Compute p-value for null hypothesis that V ≤ 0 by conducting t-test Ideal discovery should have large V value and small p-value Meaningfulness Valid discoveries may not be meaningful Relevance, novelty and significance can be used to rate how meaningful a discovery is Relevance is based on how related the discovery is to the research goal Novelty is based on how difficult it is to generate the discovery Significance is based on how beneficial it is to learn the discovery An ideal discovery should have high ratings for all three submetrics Method System maps from corpus pair and research goal to set of natural language predicates Inspired by two-stage model of how humans discover patterns in data Propose hypotheses conditioned on research goal and subset of samples from corpus pair Validate each hypothesis to see if it is more often true on one corpus than the other Leverage research goal to propose more meaningful hypotheses Hypothesis proposer Prompted GPT-3 to propose hypotheses Included research goal in prompt to elicit meaningful hypotheses Hypothesis validator Hypotheses in H init are often invalid Use language model T to simulate Turkers’ judgement and approximate validity score V of hypothesis h Use FLAN-T5 to ask whether x satisfies h Collect additional Turker annotations to fine-tune FLAN-T5 Perform t-test to compare mean value of V (h, x) on research split of Corpus A and mean value on Corpus B Rule out hypotheses with p-value greater than 0....