It’s too early to say how the wave of deals between AI companies and publishers will shake out. However, OpenAI has already scored one clear victory: its web crawlers are not being blocked by leading news outlets as quickly as they used to.
The generative AI boom sparked a data gold rush – and a subsequent data protection rush (for most news websites, anyway), in which publishers tried to block AI bots and prevent their work from becoming training data without consent . When Apple debuted a new AI agent this summer, for example, multiple leading news outlets quickly disavowed Apple’s web scans using the Robot Exclusion Protocol, or robots.txt, the file that allows webmasters to control bots. There are so many new AI bots on the scene that it can feel like you’re playing whack-a-mole to keep up.
OpenAI’s GPTBot has the most name recognition and is also blocked more often by competitors like Google AI. The number of high-profile media websites using robots.txt to “ban” OpenAI’s GPTBot has increased dramatically since its launch in August 2023. until this autumn, then rise steadily (but more gradually) from November 2023. by April 2024, according to an analysis of 1,000 popular news outlets by Ontario-based AI discovery startup Originality AI. At its peak, the high level was just over a third of websites; it has now fallen to nearly a quarter. Within a smaller group of the most prominent news outlets, the blocking rate is still above 50 percent, but is down from heights earlier this year of nearly 90 percent.
These declines make obvious sense. When companies enter into partnerships and give permission for their data to be used, they no longer have an incentive to barricade it, so they should update their robots.txt files to allow crawling; close enough trades and the overall percentage of sites blocking bots will almost certainly decrease. Some outlets unblocked OpenAI’s bots the same day they announced a deal, such as The Atlantic. Others took a few days to a few weeks, like Vox, which announced its partnership in late May but unblocked GPTBot on its properties by the end of June.