How to use robots.txt to block AI crawler robots

how? Whether you're a content creator or a blogger, you generate unique, high-quality content for a living. Have you noticed that generative AI platforms like OpenAI or CCBot use your content to train their algorithms without your consent? Don't worry! you can use it .txt files prevent these AI crawlers from accessing your website or blog.

如何使用 robots.txt 阻止 AI 爬虫机器人

What is a robots.txt file?

robots.txt is nothing more than a text file that instructs robots (such as search engine robots) how to crawl and index the pages on their website. You can block/allow good or bad robots tracking your robots.txt file. The syntax for blocking a single bot using a user agent is as follows:

user-agent: {BOT-NAME-HERE} disallow: /

 

Here's how to allow specific bots to crawl your site using your user agent:

User-agent: {BOT-NAME-HERE} Allow: /

 

Where to place the robots.txt file?

Upload the file to your website's root folder. So the URL will look like this:

https://example.com/robots.txt https://blog.example.com/robots.txt

For more information, see the following resources on robots.txt:

  1. from GoogleIntroduction to robots.txt.
  2. What is robots.txt? | How about a robots.txt file?In CloudflareWork.

How to block AI crawler robots using robots.txt file

The syntax is the same:

user-agent: {AI-Ccrawlers-Bot-Name-Here} disallow: /

 

Block OpenAI using robots.txt file

Add the following four lines to robots.txt:

User-agent: GPTBot Disallow: / User-agent: ChatGPT-User Disallow: /

 

Note that OpenAI has two separate user agents for web crawling and browsing, each with its own CIDR and IP range. To configure the firewall rules listed below, you need an in-depth understanding of networking concepts and Linux root-level access. If you lack these skills, consider using the services of a Linux system administrator to prevent access from changing IP address ranges. This could turn into a game of cat and mouse.

1:ChatGPT-UserBy ChatGPT inplug-inuse

The following is a list of user agents used by OpenAI crawlers and getters, including CIDR or IP address ranges, used to block plug-in AI bots that you can use with your web server firewall. You can use the ufw command or the iptables command on the web server to block23.98.142.176/28.For example, the following is a firewall rule that uses UFW to block CIDR or IP ranges:

sudo ufw deny proto tcp from 23.98.142.176/28 to any port 80 sudo ufw deny proto tcp from 23.98.142.176/28 to any port 443

2:GPTBotUsed by ChatGPT

The following are the user agents used by OpenAI crawlers and getterslist, includingCIDRor IP address range for blocking AI bots that you can use with a web server firewall. Likewise, you can block these ranges using the ufw command or the iptables command. Here is the shell script to block these CIDR ranges:

#!/bin/bash # Purpose: Block OpenAI ChatGPT bot CIDR # Tested on: Debian and Ubuntu Linux # Author: Vivek Gite {https://www.cyberciti.biz} under GPL v2.x+ # ------ -------------------------------------------------- ---------- file="/tmp/out.txt.$$" wget -q -O "$file" https://openai.com/gptbot-ranges.txt 2>/dev/null while IFS= read -r cidr do sudo ufw deny proto tcp from $cidr to any port 80 sudo ufw deny proto tcp from $cidr to any port 443 done < "$file" [ -f "$file" ] && rm -f "$file"

 

Block Google AI (Bard and Vertex AI generation API)

Add the following two lines to your robots.txt:

User-agent: Google-Extended Disallow: /

 

For more information, see the following list of user-agents used by Google's crawlers and extractors. However, Google does not provide CIDR, IP address ranges, or autonomous system information (ASN) to block AI bots that you can use with a web server firewall.

Block commoncrawl (CCBot) using robots.txt file

Add the following two lines to your robots.txt:

User-agent: CCBot Disallow: /

 

Although Common Crawl is anon-profit foundation, but each uses data to train their artificial intelligence through its bot called CCbot. It's also important to stop them. However, like Google, they don't provide CIDR, IP address ranges, or autonomous system information (ASN) to block AI bots that you can use with a web server firewall.

Block Perplexity AI using robots.txt file

Another service takes all your content and rewrites it using generative artificial intelligence. You can block it as follows:

User-agent: PerplexityBot Disallow: /

 

They also publishedIP address range, you can block using a WAF or web server firewall.

Can AI robots ignore my robots.txt file?

Well-known companies such as Google and OpenAI often adhere to the robots.txt protocol. But some poorly designed AI robots will ignore your robots.txt.

Is it possible to block AI bots using AWS or Cloudflare WAF technology?

Cloudflare recently announced, they introduced a new firewall rule that blocks AI bots. However, search engines and other bots can still use your website/blog through their WAF rules. It’s important to remember that WAF products require a thorough understanding of how bots operate and must be implemented with care. Failure to do so may also result in other users being blocked. Here’s how to block AI bots using Cloudflare WAF:

如何使用 robots.txt 文件阻止 AI 爬虫机器人-1

Click to enlarge

Note that I'm evaluating the Cloudflare solution, but my initial testing shows it blocks at least 3.31% of users. 3.31% is the CSR (Challenge Resolution Rate) rate, i.e. people solving the verification code provided by Cloudflare. This is a very high CSR rate. I need to do more testing. I'll update this post when I start using Cloudflare.

 

Can I block access to code and documents hosted on GitHub and other cloud hosting sites?

No, I don't know if this is possible.

我对使用 GitHub 感到担忧,它是微软的产品,也是 OpenAI 的最大投资者。他们可能会使用您的数据通过服务条款更新和其他漏洞来训练人工智能。最好是您的公司或您独立托管 git 服务器,以防止您的数据和代码被用于培训。苹果等大公司禁止内部使用 ChatGPT 和类似产品,因为他们担心这可能会导致代码和敏感数据泄露。

When AI is used to benefit humanity, is it ethical to prevent AI bots from accessing training data?

I'm skeptical about using OpenAI, Google Bard, Microsoft Bing, or any other artificial intelligence to benefit humanity. This seems to be just a money-making scheme while generative AI replaces white-collar jobs. But if you have any information on how my data can be used to cure cancer (or something similar), please feel free to share it in the comments section.

My personal opinion is that I am not benefiting from OpenAI/Google/Bing AI or any artificial intelligence right now. I've worked hard for over 20 years and I need to protect my job from direct profit from these big tech companies. You don't have to agree with me. You can hand over your code and other stuff to AI. Remember, this is optional. The only reason they offer robots.txt control now is because multiple book authors and companies are suing them in court. In addition to these problems, artificial intelligence tools are also used to create spam websites and e-books. See selected readings below:

It’s true that AI already uses most of your data, but anything you create in the future can be protected through these technologies.

add up

As generative AI becomes more popular, content creators are beginning to question AI companies using data to train their models without permission. They profit from the code, text, images and videos created by millions of small independent creators, while depriving them of their source of income. Some may not object, but I know that such a sudden move would devastate many people. Therefore, website operators and content creators should be able to easily block unwanted AI crawlers. The process should be simple.

I will update this page as it becomes possible to block more robots via robots.txt and using cloud solutions provided by third parties such as Cloudflare.

Other open source projects to block bots

score

Leave a Reply

Your email address will not be published. Required fields are marked *