Link to paper The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract Need for large-scale high-quality text datasets BigScience workshop formed to research and train large language models ROOTS corpus created, 1.6TB dataset spanning 59 languages BLOOM language model trained using ROOTS corpus Large initial subset of corpus released with processing tools Paper Content Introduction BigScience1 is a one-year open collaborative research initiative Goal was to train an open-access, massively multilingual language model Engaged in ethical, sociopolitical, and data governance issues Four working groups: Data Governance, Data Sourcing and Preparation, Privacy, Legal Scholarship Released a large subset of ROOTS Released data tools used to curate, source, clean and inspect constituent datasets Outline of the paper Collected a web-scale dataset covering 59 languages 46 natural languages and 13 programming languages 62% of text from community-selected and documented list of language data sources 38% of text from pre-processed web crawl, OSCAR Filtered with help of native speakers Related work Pre-trained models are used in natural language processing Performance is based on model size and dataset size/quality Recent models trained on up to 1....