The Race to Block OpenAI’s Scraping Bots Is Slowing Down

Posted on:
Key Points

The generative AI boom sparked a gold rush for dataand a subsequent data-protection rush (for most news websites, anyway) in which publishers sought to block AI crawlers and prevent their work from becoming training data without consent..

The number of high-ranking media websites using robots.txt to disallow OpenAIs GPTBot dramatically increased from its August 2023 launch until that fall, then steadily (but more gradually) rose from November 2023 to April 2024, according to an analysis of 1,000 popular news outlets by Ontario-based AI detection startup Originality AI..

When companies enter into partnerships and give permission for their data to be used, theyre no longer incentivized to barricade it, so it would follow that they would update their robots.txt files to permit crawling; make enough deals and the overall percentage of sites blocking crawlers will almost certainly go down..

(Time did not respond to WIREDs request for comment on why it still had GPTBot blocked.) However, once the deals are made, its unimportant, according to OpenAI spokesperson Kayla Wood, as OpenAI no longer accesses the data in the same way it approaches crawling what it calls publicly available data..

We leverage direct feeds, she says.. Meanwhile, there are a few notable media outlets that have unblocked OpenAIs web crawler despite not making any sort of partnership announcement, as data journalist Ben Welsh pointed out to WIRED..

You might be interested in

Major Sites Are Saying No to Apple’s AI Scraping

30, Aug, 24

This summer, Apple gave websites more control over whether the company could train its AI models on their data. Major publishers and platforms like The New York Times and Facebook have already opted out.