Kadoa icon
Blog post illustration

AI Bots vs Humans: An Arms Race Without a Winner

Adrian Krebs,Co-Founder & CEO of Kadoa
10 February 2025
TechBack to blog

A research report from last year reveals that now almost 50% of internet traffic comes from non-human sources.

We explore the challenges of an internet where bots outnumber humans and how both can coexist.

AI Crawlers

Web scraping has been around since the early days of the internet. Search crawlers index pages to serve search results and these traditional bots usually follow established best practices and respect server limitations.

AI crawlers don't play by these rules. AI companies now deploy bots that often ignore traditional conventions to collect data at unprecedented scale to train their AI models.

Numerous reports prove that bots ignore robots.txt and even spoof their user agents.

Microsoft Bot:

For past 2-3 months my company is getting CPU and RAM usage alert from servers due to Microsoft Bots with user agent “-“. We have opened an abuse ticket with them and they closed it with some random excuse. We are seeing ChatGPT bots too along with them.

— Source: reddit.com

OpenAI Bot:

So after that I decided to block by User-agent just to find out they sneakily removed the user agent to be able to scan my website.

— Source: reddit.com

Amazon Bot:

I don't want to have to close off my Gitea server to the public, but I will if I have to. It's futile to block AI crawler bots because they lie, change their user agent, use residential IP addresses as proxies, and more. I just want the requests to stop.

— Source: xeiaso.net

ByteDance Bot:

This month, Fortune.com reported that TikTok’s web scraper — known as Bytespider — is aggressively sucking up content to fuel generative AI models.

— Source: haproxy.com

Some AI companies apparently don't care about any best practices or cost to website operators.

AI Agents

While AI crawlers do large scale training data collection, AI agents automate tasks that previously required human interaction. They navigate websites purposefully, fill forms, make decisions, and extract specific information—much like a human would.

The industry calls 2025 "the year of AI agents" as we see more and more agentic automation frameworks. Example of such frameworks are OpenAI Operator, Claude Computer Use, or Runner H.

Unlike crawlers, these agents generate focused, legitimate traffic. They follow normal navigation patterns and usually don't hit rate limits. But they also raise new questions: How do we distinguish good automation from bad?

So not all bots are bad, and we will see more and more legitimate automation use cases where bots serve useful automation purposes without negative side effects.

Bot Protection

Website owners are starting to deploy various defenses against bad bots. But these measures come with downsides like degrading website performance and frustrated legitimate users. There has been an arms race for decades of people trying to stop malicious bots, so mitigation techniques are well known unless you can come up with something truly unique.

Robots.txt

The robots.txt was designed as the internet's gentlemen's agreement between websites and bots. Through a simple text file, websites can provide some directives for bots, like the minimum crawl delay they should use and any parts of the websites they should not crawl.

User-Agent: *
Disallow: *.gif$
Disallow: /example/
Allow: /publications/

However, the Robots Exclusion Protocol is based on the good-faith spirit of the internet, and not a technical enforcement. It is up to the individual company to determine if they want to respect it or not.
When bots ignore these rules, admins often resort to IP range blocking, though this might not work if the bot is using rotating proxies and different user agents.

CAPTCHAs

The most common defense against bots is the CAPTCHA system, which is typically deployed through services like Cloudflare. These services act as reverse proxies to monitor traffic and do browser fingerprinting to distinguish humans from bots.

CAPTCHAs have been around for over 25 years and their effectiveness has diminished. The traditional distorted text challenges have become obsolete as computers are now better at solving them than humans.

Open-source CAPTCHA-solving services like FlareSolverr are able to automatically bypass CAPTCHAs and OpenAI probably has a lot more sophisticated closed-source methods to get around them. An entire industry has emerged around defeating these protections.

Every protection comes at a cost. Website owners report 20-30% drops in page views and conversions when implementing CAPTCHAs, which makes them a last-resort solution for spam attacks rather than simply blocking annoyances. As one user summarizes:

I'm at the point now that if I get a CAPTCHA, I'm just going to leave the site. I'll spend my money elsewhere or find an alternative

— Source: reddit.com

CAPTCHAs also come with serious accessibility concerns. For example, blind users have to get a special cookie that lets them get past hCaptcha without a challenge. Audio-based alternatives have similar issues.

The human and environmental cost is significant as a 2023 study of UC Irvine shows:

Metric
Time Wasted
Value
819 million hours
Metric
Average Solve Time
Value
3.53 seconds
Metric
Economic Cost
Value
$6.1 billion
Metric
Bandwidth Consumed
Value
134 Petabytes
Metric
Energy Usage
Value
7.5 million kWhs
Metric
CO2 Emissions
Value
7.5 million pounds

*Based on an estimated 512 billion CAPTCHAs completed globally between 2010-2023.

Despite all the drawbacks, CAPTCHAs still work well for raising the cost of successful spamming above the expected payoff. Solving them at scale remains expensive, and it would be uneconomical for most bad actors.

Poisoning training data

As traditional defenses fail, website owners are becoming more creative by trying to poison the LLM training data the crawlers are collecting.

Two libraries I've recently discovered:

  • Nepenthes: Traps AI crawlers in endless loops of nonsensical content and specifically targets known AI company IP ranges
  • Quixotic: Subtly alters content using Markov chains to make it useless for LLM training

Some administrators even set up honeypot pages. They put links somewhere in their site that no human would visit, disallow it in the robots.txt, then ban any IP address that visits the link.

The arms race continues, with both protection and circumvention becoming more costly and sophisticated.

The Path Forward

Standardized agent interfaces

Rather than continuing the arms race, a more sustainable solution may be developing standardized interfaces for agent-software interaction. This would allow websites to maintain control while providing structured access to their content.

Anthropic's Model Context Protocol (MCP) offers an open protocol that standardizes how applications provide context to AI agents. These specifications could evolve into industry standards, similar to how RSS feeds once (tried to) standardized content syndication.

Content monetization

The rise of AI search engines has disrupted traditional web economics. When LLMs provide direct answers without prominently linking to sources, content creators lose both traffic and advertising revenue. If there would be a way a website can sell data to an LLM provider rather than being constantly DDoS'd by a crawler, that could work.

This likely won't work for any smaller content providers as the large AI players will only pay for data when it's owned by someone with scary enough lawyers. That's why we have seen wave of negotiations and conflicts between media companies and AI companies:

  • Google and OpenAI closed deals with Reddit
  • Axel Springer's licensing agreement with OpenAI
  • AP news-sharing and tech deal with OpenAI
  • Agence France-Presse (AFP) and Mistral AI have entered into a multi-year partnership

Several startups are now developing platforms to help with content licensing for AI training. Established players like Cloudflare are probably best positioned to license content at scale as they already own most of the web traffic.

Human verification

The limitations of CAPTCHAs have lead to exploration of more reliable methods to distinguish humans from bots.
World(coin) is working verifying users online as a biological human with biometric data through their World ID system. Ironically, Sam Altman is a major investor in them, leading one of the primary companies whose AI models drive increased bot traffic

Conclusion

Are we heading towards a "dead internet" where most of the internet is fake and mainly consists of bots? No. But we are moving toward a web where automated agents and humans will increasingly coexist as equal participants. The challenge isn't stopping bots - it's building systems that enable beneficial automation while protecting against abusive AI crawlers that threaten both infrastructure and business models of websites.

This will require rethinking how good bots and humans coexist on the web, without further increasing costs and frustrations to protect us from bad bots.