Editing Robots.txt (section)

===Artificial intelligence===
Starting in the 2020s, web operators began using robots.txt to deny access to bots collecting training data for [[generative artificial intelligence|generative AI]]. In 2023, Originality.AI found that 306 of the thousand most-visited websites blocked [[OpenAI]]'s GPTBot in their robots.txt file and 85 blocked [[Google]]'s Google-Extended. Many robots.txt files named GPTBot as the only bot explicitly disallowed on all pages. Denying access to GPTBot was common among news websites such as the [[BBC]] and ''[[The New York Times]]''. In 2023, blog host [[Medium (website)|Medium]] announced it would deny access to all artificial intelligence web crawlers as "AI companies have leached value from writers in order to spam Internet readers".<ref name="Verge"/>

GPTBot complies with the robots.txt standard and gives advice to web operators about how to disallow it, but ''[[The Verge]]''{{'}}s David Pierce said this only began after "training the underlying models that made it so powerful". Also, some bots are used both for search engines and artificial intelligence, and it may be impossible to block only one of these options.<ref name="Verge"/> ''[[404 Media]]'' reported that companies like [[Anthropic]] and [[Perplexity.ai]] circumvented robots.txt by renaming or spinning up new scrapers to replace the ones that appeared on popular [[blocklist]]s.<ref>{{Cite web |last=Koebler |first=Jason |date=2024-07-29 |title=Websites are Blocking the Wrong AI Scrapers (Because AI Companies Keep Making New Ones) |url=https://www.404media.co/websites-are-blocking-the-wrong-ai-scrapers-because-ai-companies-keep-making-new-ones/ |access-date=2024-07-29 |website=404 Media}}</ref>