LAION Releases Re-LAION-5B Dataset Thoroughly Cleaned of Illegal Content for AI Training

German research organization LAION has released a new dataset called Re-LAION-5B, which it claims has been thoroughly cleaned of links to suspected child sexual abuse material (CSAM). LAION implemented fixes and recommendations from various organizations, including the Internet Watch Foundation and Human Rights Watch, to ensure the dataset’s integrity. The dataset is available in two versions: Re-LAION-5B Research and Re-LAION-5B Research-Safe, which also removes additional NSFW content. LAION’s datasets do not contain images but rather indexes of links to images and image alt text from the Common Crawl dataset.

The release of Re-LAION-5B follows an investigation by the Stanford Internet Observatory in December 2023, which found that LAION-5B included links to illegal images scraped from social media posts and adult websites. As a result, LAION temporarily took LAION-5B offline. The Stanford report suggested deprecating and ceasing distribution of models trained on LAION-5B, and AI startup Runway recently took down its Stable Diffusion 1.5 model from the AI hosting platform Hugging Face.

The new Re-LAION-5B dataset contains approximately 5.5 billion text-image pairs and is released under an Apache 2.0 license. LAION emphasizes that the metadata can be used by third parties to clean existing copies of LAION-5B by removing matching illegal content. LAION specifies that its datasets are intended for research purposes only, but some organizations, such as Stability AI and Google, have used LAION datasets for training their models.

LAION removed 2,236 links to suspected CSAM after matching them with lists of link and image hashes provided by partners. These links also include 1,008 links identified by the Stanford Internet Observatory report. LAION strongly encourages research labs and organizations still using old LAION-5B to migrate to Re-LAION-5B datasets promptly.