AI web spiders: companies warn of the arrival of AI web spiders

Businesses are increasingly resorting to blocking crawlers and artificial intelligence (AI) crawlers that crawl the web bit by bit and hamper website performance, executives and experts say of the sector.

AI crawlers are computer programs that collect data from websites to train large language models. With the increased use of AI search and the need to collect training data, the Internet is seeing many new web scrapers such as Bytespider, PerplexityBot, ClaudeBot and GPTBot.

Until 2022, the Internet had conventional search engine crawlers such as GoogleBot, AppleBot, and BingBot, which obeyed the principles of ethical content retrieval and planning for decades.

ETtech

On the other hand, aggressive AI bots not only violate content guidelines but also degrade website performance, increasing overhead and posing security threats. Many websites and content portals implement anti-scraping measures or bot restriction technologies to counter this. According to Cloudflare, a leading provider of content delivery networks, almost 40% of the top 10 Internet domains accessible by 80% of AI bots are moving to block AI crawlers.

Discover the stories that interest you

Indian technology body Nasscom said these crawlers are particularly harmful to news publishers if they use content created without attribution. “If using copyrighted data for training AI models is considered fair use, it is moot,” said Raj Shekhar, head of AI at Nasscom. “The legal dispute between ANI Media and OpenAI is a wake-up call for AI developers to follow IP (intellectual property) laws when collecting training data. Developers should therefore exercise caution and consult with intellectual property experts to ensure compliant data practices and avoid potential liabilities.

Reuben Koh, director of technology and security strategy at content delivery network company Akamai Technologies, said: “Scraping causes significant overhead and impacts the performance of a website. To do this, it interacts intensively with the site, trying to remove every piece of content. This incurs a performance penalty.

According to Cloudflare’s analysis of the top 10,000 internet domains, three AI bots had the largest share of websites visited: Bytespider operated by Chinese company TikTok (40.40%), GPTBot operated by OpenAI (35.46%) and ClaudeBot managed by Anthropic (11.17%). Although these AI bots follow the rules, Cloudflare customers overwhelmingly choose to block them, he says. Meanwhile, there is CCBot, developed by Common Crawl, to scrape the web and create an open source dataset that can be used by anyone.

What sets AI crawlers apart

AI crawlers are different from conventional crawlers: they target high-quality text, images, and videos that can enhance training datasets. AI-based crawlers are smarter than traditional search engine bots, “which just crawl, collect data and call it a day,” said Akamai’s Koh. “Their intelligence is not only used for data selection but also for data classification and prioritization. This means that even after they have crawled, indexed and retrieved all the data, they can process what the data is going to be used for,” he said.

Traditionally, Web Scraper robots follow the robots.txt protocol as a guiding principle on what can be indexed. Traditional search engine bots such as GoogleBot and BingBot adhere to this and stay away from intellectual property. However, AI robots have been found to violate the principles of robots.txt on several occasions. “Google and Bing don’t overwhelm websites because they follow a predictable and transparent indexing schedule. For example, Google clearly states how often it indexes a particular domain, allowing businesses to anticipate and manage the potential impact on performance,” said Koh. “With newer, more aggressive crawlers, such as those driven by AI, the situation is less predictable. These bots do not necessarily operate on a fixed schedule and their scraping activities can be much more intensive.

Koh warned of a third category of crawlers that are malicious in nature and misuse data for fraudulent purposes. According to Akamai’s State of The Internet study, more than 40% of all Internet traffic comes from bots, with approximately 65% coming from malicious bots.

I can’t block them all

However, according to experts, eliminating AI crawlers cannot be the ultimate solution as websites need to be discovered. Websites must appear in commercial search engine results, be discovered and gain customers, if AI search is to become the new search practice, they said. “Companies will be concerned if we block legitimate revenue-generating exploration activities or robot activities. Or are we allowing too much malicious activity to occur on our website? It’s a very delicate balance, they need to understand,” Koh said.