AI Bots vs Humans: An Arms Race Without a Winner

Bots are now estimated to make up nearly half of all internet traffic worldwide. The rise of AI is further accelerating this trend as companies collect training data or deploy autonomous browser automations.

Distinguishing real humans from bots has lead to an expensive arms race. We explore the challenges of an internet where bots outnumber humans and how both can coexist.

Bad Bots vs Good Bots

The divide between good and bad bots isn’t always clear, but here’s how we can generally classify them.

Good bots serve useful functions without negative impacts. The most well-known good bots are search engine crawlers that index websites so users can find relevant information, products, and services. These bots have been around since the early days of the internet and generally follow established best practices like robots.txt and rate limits.

Bad bots, on the other hand, operate in abusive ways that have a negative affect website owners by causing higher costs, spam, or downtime. OWASP provides a list of bot attacks, such as au tomated account creation, ad fraud, or DDoS.

With the rise of AI, two new types of bots have emerged, AI crawlers and AI agents.

AI Crawlers

Most AI crawlers can be considered bad bots. AI companies deploy these crawlers to collect data to train their AI models.

Reports indicate that many AI crawlers disregard robots.txt, spoof user agents, and use other evasive tactics to bypass restrictions.

OpenAI Bot:

So after that I decided to block by User-agent just to find out they sneakily removed the user agent to be able to scan my website.

^{— Source: reddit.com}

Amazon Bot:

I don't want to have to close off my Gitea server to the public, but I will if I have to. It's futile to block AI crawler bots because they lie, change their user agent, use residential IP addresses as proxies, and more. I just want the requests to stop.

^{— Source: xeiaso.net}

ByteDance Bot:

This month, Fortune.com reported that TikTok’s web scraper — known as Bytespider — is aggressively sucking up content to fuel generative AI models.

^{— Source: haproxy.com}

This trend is creating a lot of headaches for developers responsible for maintaining heavily crawled sites. AI companies don't seem to care about any crawling best practices that the industry has developed over the past two decades.

AI Agents

While AI crawlers do large scale training data collection, AI agents automate tasks that previously required human interaction. They navigate websites, fill forms, make decisions, and extract specific information—much like a human would.

The leading AI companies are integrating agentic capabilities into their models, such as OpenAI Operator or Anthropic Computer Use.

Unlike crawlers, these agents generate focused, legitimate traffic. They follow normal navigation patterns and usually don't hit rate limits. A recent report by security firm Stytch found that agent traffic is often nearly indistinguishable from human activity.

Not all bots are harmful, and we will see more and more legitimate automation use cases of bots offering useful automation without negative side effects.

Bot Protection

Website owners are deploying increasingly sophisticated defenses against bots. However, these measures come with trade-offs like degrading website performance and frustrated legitimate users. There has been an arms race for decades of people trying to stop malicious bots, so mitigation techniques are well known unless someone can come up with something truly unique.

Robots.txt

The robots.txt was designed as the internet's gentlemen's agreement between websites and bots. It allows website owners to set basic directives for crawlers through a simple text file, such as crawl delays and restricted areas of the site:

User-Agent: *
Disallow: *.gif$
Disallow: /example/
Allow: /publications/

However, the Robots Exclusion Protocol is based on the good-faith spirit of the internet, and not a technical enforcement. It is up to the individual company to determine if they want to respect it or not.
When bots ignore these rules, admins often resort to IP range blocking, though this might not work if the bot is using rotating proxies and different user agents.

CAPTCHAs

One of the most common defenses against bots is CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart).

CAPTCHAs have been around for over 25 years and their effectiveness is declining. The traditional distorted text challenges have become obsolete as computers are now better at solving them than humans.

Open-source tools like FlareSolverr are able to automatically bypass CAPTCHAs and companies like OpenAI likely have even more sophisticated closed-source methods to get around them. An entire industry has emerged around defeating these protections.

Beyond security concerns, CAPTCHAs come with serious usability drawbacks. Website owners report 20-30% drops in page views and conversions when implementing CAPTCHAs, which makes them more of a last-resort solution for spam attacks than a first line of defense. As one frustrated user put it:

I'm at the point now that if I get a CAPTCHA, I'm just going to leave the site. I'll spend my money elsewhere or find an alternative

^{— Source:
reddit.com}

CAPTCHAs also raise accessibility concerns. For instance, blind users must obtain a special cookie to bypass hCaptcha. Audio-based alternatives have similar issues.

The human and environmental cost is significant. A 2023 study of UC Irvine quantified the global impact of CAPTCHAs:

Metric	Value
Metric Time Wasted	Value 819 million hours

Metric Average Solve Time	Value 3.53 seconds

Metric Economic Cost	Value $6.1 billion

Metric Bandwidth Consumed	Value 134 Petabytes

Metric Energy Usage	Value 7.5 million kWhs

Metric CO2 Emissions	Value 7.5 million pounds

Metric

Time Wasted

Value

819 million hours

Metric

Average Solve Time

Value

3.53 seconds

Metric

Economic Cost

Value

$6.1 billion

Metric

Bandwidth Consumed

Value

134 Petabytes

Metric

Energy Usage

Value

7.5 million kWhs

Metric

CO2 Emissions

Value

7.5 million pounds

^{*Based on an estimated 512 billion CAPTCHAs completed globally between
2010-2023.}

Despite all the drawbacks, CAPTCHAs still work well for raising the cost of successful spamming above the expected payoff. Solving them at scale remains expensive, and it would be uneconomical for most bad actors.

Poisoning training data

As traditional defenses fail, website owners are becoming more creative by trying to poison the LLM training data the crawlers are collecting.

Two libraries I've recently discovered:

Nepenthes: Traps AI crawlers in endless loops of nonsensical content and specifically targets known AI company IP ranges
Quixotic: Subtly alters content using Markov chains to make it useless for LLM training

Some administrators even set up honeypot pages. They put links somewhere in their site that no human would visit, disallow it in the robots.txt, then ban any IP address that visits the link.

As bot protections become more sophisticated, so do the bots. This is driving up costs on both sides without a clear winner in sight.

The Path Forward

Standardized agent interfaces

Rather than continuing the arms race,a more sustainable approach could be the development of standardized interfaces for AI agents. This would allow websites to maintain control while providing structured access to their content.

Anthropic's Model Context Protocol (MCP) is an early attempt at an open protocol that standardizes how applications provide context to AI agents. These specifications could evolve into industry standards, similar to how RSS feeds once aimted to standardize content syndication.

Content monetization

The rise of AI search engines has disrupted traditional web economics. When LLMs provide direct answers without prominently linking to sources, content creators lose both traffic and advertising revenue. If there would be a way a website can sell data to an LLM provider rather than being constantly DDoS'd by a crawler, that could work.

This likely won't work for any smaller content providers as the large AI players will only pay for data when it's owned by someone with scary enough lawyers. That's why we have seen wave of negotiations and conflicts between media companies and AI companies:

Google and OpenAI closed deals with Reddit
Axel Springer's licensing agreement with OpenAI
AP news-sharing and tech deal with OpenAI
Agence France-Presse (AFP) and Mistral AI have entered into a multi-year partnership

Several startups are now developing platforms to help with content licensing for AI training. Established players like Cloudflare are probably best positioned to license content at scale as they already own most of the web traffic.

Human verification

The limitations of CAPTCHAs have lead to exploration of more reliable methods to distinguish humans from bots.
World(coin) is working on verifying users online as a biological human with biometric data through their World ID system. Ironically, Sam Altman is a major investor in them, leading one of the primary companies whose AI models drive increased bot traffic

Conclusion

Are we heading towards a "dead internet" where most of the internet is fake and mainly consists of bots? No. But we are moving toward a web where good bots and humans will increasingly coexist as equal participants. The challenge isn't stopping ALL bots - it's building systems that enable beneficial automation while protecting against abusive AI crawlers.