Publishers Target Common Crawl In Fight Over AI Training Data

Posted on: 14 Jun, 07:20 AM

Key Points

Danish media outlets have demanded that the nonprofit web archive Common Crawl remove copies of their articles from past data sets and stop crawling their websites immediately..

Common Crawl plans to comply with the request, first issued on Monday..

In its complaint, the New York Times highlighted how Common Crawls data was the most highly weighted data set in GPT-3.. Thomas Heldrup, the DRAs head of content protection and enforcement, says that this new effort was inspired by the Times..

Common Crawl is caught up in this conflict about copyright and generative AI, says Stefan Baack, a data analyst at the Mozilla Foundation who recently published a report on Common Crawls role in AI training..

Common Crawls evolution from low-key tool beloved by data nerds and ignored by everyone else to a newly-controversial AI helpmate is part of a larger clash over copyright and the open web. A growing contingent of publishers as well as some artists, writers, and other creative types are fighting efforts to crawl and scrape the websometimes even if said efforts are noncommercial, like Common Crawls ongoing project..

Full story at WIRED |

Navigation

Key Points