Authors:
(1) Thuat Nguyen, Dept. of Computer Science, University of Oregon, OR, USA;
(2) Chien Van Nguyen, Dept. of Computer Science, University of Oregon, OR, USA;
(3) Viet Dac Lai, Dept. of Computer Science, University of Oregon, OR, USA;
(4) Hieu Man, Dept. of Computer Science, University of Oregon, OR, USA;
(5) Nghia Trung Ngo, Dept. of Computer Science, University of Oregon, OR, USA;
(6) Franck Dernoncourt, Adobe Research, USA;
(7) Ryan A. Rossi, Adobe Research, USA;
(8) Thien Huu Nguyen, Dept. of Computer Science, University of Oregon, OR, USA.
Table of Links
Abstract
The driving factors behind the development of large language models (LLMs) with impressive learning capabilities are their colossal model sizes and extensive training datasets. Along with the progress in natural language processing, LLMs have been frequently made accessible to the public to foster deeper investigation and applications. However, when it comes to training datasets for these LLMs, especially the recent state-of-the-art models, they are often not fully disclosed. Creating training data for high-performing LLMs involves extensive cleaning and deduplication to ensure the necessary level of quality. The lack of transparency for training data has thus hampered research on attributing and addressing hallucination and bias issues in LLMs, hindering replication efforts and further advancements in the community. These challenges become even more pronounced in multilingual learning scenarios, where the available multilingual text datasets are often inadequately collected and cleaned. Consequently, there is a lack of open-source and readily usable dataset to effectively train LLMs in multiple languages. To overcome this issue, we present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for LLM development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. CulturaX is fully released to the public in HuggingFace to facilitate research and advancements in multilingual LLMs: https://huggingface.co/ datasets/uonlp/CulturaX.
1 Introduction
Large language models (LLMs) have fundamentally transformed research and applications of natural language processing (NLP), significantly advancing the state-of-the-art performance for numerous tasks and revealing new emergent abilities (Brown et al., 2020; Wei et al., 2022). Based on the transformer architecture (Vaswani et al., 2017), three major variants of LLMs have been explored in the literature: the encoder-only models to encode input texts into representation vectors, e.g., BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019); the decoder-only models to generate texts, e.g., GPT (Radford et al., 2019; Brown et al., 2020); and the encoder-decoder models to perform sequence-to-sequence generation, e.g., BART (Lewis et al., 2020) and T5 (Raffel et al., 2020). The remarkable capabilities of LLMs have primarily been propelled by the ever-expanding scale of model sizes and training datasets, which have been deemed essential for achieving optimal performance by the scaling laws (Hernandez et al., 2022). For instance, beginning with the BERT model, which had a mere few hundred million parameters (Devlin et al., 2019), recent GPT-based models have been expanded to encompass hundreds of billions of parameters (Shoeybi et al., 2019; Scao et al., 2022; Lieber et al., 2021; Chowdhery et al., 2022). Similarly, the training datasets for LLMs have grown exponentially, evolving from a modest 13GB of text data from Wikipedia and books used for BERT (Devlin et al., 2019; Liu et al., 2019) to consume terabytes of data for the latest models, such as Falcon (Penedo et al., 2023), MPT (MosaicML, 2023), LLaMa (Touvron et al., 2023), PolyLM (Wei et al., 2023) and ChatGPT[1].
As the field keeps progressing rapidly, pretrained LLMs have typically been released to the public to foster further research and advancements. These models are obtainable either through commercial APIs, as illustrated by ChatGPT and GPT4, or via open-source initiatives, exemplified by Falcon and LLaMa. Nevertheless, in contrast to the public accessibility of LLMs, the training datasets that underpin the state-of-the-art models have mostly remained closely guarded secrets, even in the case of open-source LLMs such as BLOOM, LLaMa, MPT, and Falcon. For example, Falcon (Penedo et al., 2023) and BLOOM (Scao et al., 2022) only provide a glimpse of their complete training data, whereas MPT’s, LLaMa’s and PolyLM’s datasets (Touvron et al., 2023; Wei et al., 2023) remain inaccessible to the public. On one hand, the lack of transparency has impeded indepth analysis and comprehension of LLMs, hindering crucial research into attributing and addressing fundamental issues stemming from the training data, such as hallucinations, biases, and toxic content (Tamkin et al., 2021; Weidinger et al., 2021; Kenton et al., 2021; Bommasani et al., 2021). On the other hand, concealing the training data restricts the development of LLMs to a select few stakeholders with ample resources, thereby constraining the democratization and benefits of the technology and exacerbating its biases within broader society.
To attain transparency and democratization for LLMs, it is thus crucial to create large-scale and high-quality datasets for training high-performing LLMs while ensuring their public accessibility to foster deeper research and advancements. In the realm of LLMs, high-quality training datasets are often crafted through the application of extensive data cleaning and deduplication processes, aimed at eliminating noisy and redundant content from vast text collections (Allamanis, 2018; Penedo et al., 2023). To this end, there have been recent efforts from the community to develop such open-source datasets for LLMs, such as RedPajama with 1.21T tokens (Computer, 2023), SlimPajama[2] with 627B tokens, and AI2 Dolma[3] with 3T tokens. However, most of the existing open-source datasets for LLMs are tailored for the English language, which hinders the utilization and performance of the resulting LLMs when applied to non-English languages, particularly those with limited linguistic resources (Bang et al., 2023; Lai et al., 2023). This emphasis on English also restricts the capacity of open-source datasets to comprehensively tackle the research challenges and democratization concerns of LLMs across the diverse spectrum of over 7,000 languages spoken worldwide.
Simultaneously, some multilingual datasets have been developed and made available, providing text data for multiple languages. Nevertheless, their quality and scale fall short of meeting the requirements for training high-performing LLMs. Specifically, the multilingual text dataset sourced from Wikipedia, while of high quality, is regarded as relatively small when it comes to training LLMs (Conneau et al., 2020). The OSCAR datasets (Ortiz Suárez et al., 2019; Ortiz Suárez et al., 2020; Abadji et al., 2021, 2022) [4] extract text data from CommonCrawl (CC) for more than 160 languages. However, these datasets lack document-level deduplication (i.e., removing similar documents in the dataset), leading to the inclusion of redundant information and impairing the performance of generative LLMs (Lee et al., 2022). Similarly, the mC4 (Xue et al., 2021), CCAligned (Conneau et al., 2020), WikiMatrix (Schwenk et al., 2021), and ParaCrawl (Bañón et al., 2020) datasets altogether support over 100 languages but suffers from less accurate language identification, introducing noise into the data (Kreutzer et al., 2022). These datasets are also not deduplicated at fuzzy and document levels, e.g., via MinHash (Broder, 1997). Additionally, the CC100 dataset (Wenzek et al., 2020; Conneau et al., 2020), employed in training the multilingual XLM-RoBERTa model across 100 languages, only considers the snapshots of CC in 2018, constraining its size and the availability of up-todate information to train high-performing LLMs.
To address the aforementioned issues for opensource datasets, our work introduces a novel multilingual dataset, called CulturaX, for training LLMs in 167 languages. CulturaX merges the latest iteration of mC4 (version 3.1.0) with all available OSCAR corpora up to the current year, encompassing distributions 20.19, 21.09, 22.01, and 23.01. This amalgamation results in a large multilingual dataset, comprising 27 TB of text data with 6.3 trillion tokens and offering the most up-to-date data for LLM development. More than half of our dataset is dedicated to non-English languages to significantly boost the data size and enhance the feasibility of training models in multilingual scenarios. Importantly, CulturaX is extensively cleaned and deduplicated at the document level to produce the highest quality to train LLMs for multiple languages. In particular, our data cleaning process includes a comprehensive pipeline designed to eliminate low-quality data. This involves removing noisy text, non-linguistic content, toxic data, incorrect language identification, and more. Our data cleaning pipeline employs a variant of the Interquartile Range (IQR) method (Dekking et al., 2007) to select appropriate thresholds for various dataset metrics (e.g., stopword ratios, data perplexity, and language identification scores), which can be used to filter noisy outliers for the dataset. As such, we leverage the percentiles of the distributions computed over large samples of data to effectively guide the threshold selection process for each filtering metric and language. Finally, we perform extensive deduplication for the data of the languages within our datasets based on the near deduplication method MinHashLSH (Broder, 1997; Leskovec et al., 2020) and URLs, leading to high-quality data to train multilingual LLMs. Our dataset will be fully available to the public to promote further research and development for multilingual learning. To our knowledge, CulturaX is the largest open-source multilingual dataset to date that is deeply cleaned and deduplicated for LLM and NLP applications.
This paper is available on arxiv under CC BY 4.0 DEED license.
[1] https://openai.com/blog/chatgpt
[2] https://www.cerebras.net/blog/slimpajama-a-6 27b-token-cleaned-and-deduplicated-version-of-r edpajama
[3] https://blog.allenai.org/dolma-3-trillion-to kens-open-llm-corpus-9a0ff4b8da64
[4] https://oscar-project.org