A dataset used to train large language models allegedly contained 12,000 live API keys and authentication credentials. Some of these were reportedly still active and allowed unauthorized access. Truffle Security found these secrets in a December 2024 Common Crawl archive, which spans 250 billion web pages. The affected credentials could have been exploited for unauthorized data access, service disruptions, financial fraud, and a variety of other malicious uses.