{"id":2376,"date":"2024-12-13T10:43:06","date_gmt":"2024-12-13T10:43:06","guid":{"rendered":"https:\/\/yolohive.com\/?p=2376"},"modified":"2024-12-13T10:43:16","modified_gmt":"2024-12-13T10:43:16","slug":"how-to-block-ai-crawler-bots-using-robots-txt-file","status":"publish","type":"post","link":"https:\/\/yolohive.com\/how-to-block-ai-crawler-bots-using-robots-txt-file\/","title":{"rendered":"How to block AI Crawler Bots using robots.txt file"},"content":{"rendered":"\n
As a content creator or blog author, your unique, high-quality content is your asset. But have you noticed that some generative AI platforms, such as OpenAI and CCBot, might be using your work to train their algorithms\u2014without your consent?<\/p>\n\n\n\n
You don\u2019t have to worry! By using a simple file called The The syntax below shows how to block a single bot using a user-agent:<\/p>\n\n\n\n Below shows how to allow specific bots to crawl your website using a user-agent:<\/p>\n\n\n\n Upload the file to your website\u2019s root folder: <\/p>\n\n\n\n If you\u2019re ready to take control of your website\u2019s accessibility, dive deeper into the details of The syntax:<\/p>\n\n\n\n Add the following two lines to your robots.txt:<\/p>\n\n\n\n For more information about managing crawlers, you can review the list of user agents<\/a><\/strong> used by Google crawlers and fetchers. This can help you identify legitimate Google bots accessing your site.<\/p>\n\n\n\n However, it\u2019s important to note that:<\/p>\n\n\n\n For advanced control, monitor your server logs for unusual activity and configure additional security measures, such as rate limiting or IP blocking, to complement your Add the following four lines to your robots.txt:<\/p>\n\n\n\n OpenAI utilizes two distinct user agents for its operations: one for web crawling and another for browsing, each associated with unique CIDR and IP address ranges. Configuring firewall rules to block these requires an advanced understanding of networking concepts and root-level access to a Linux server.<\/p>\n\n\n\n If you\u2019re not familiar with these technical aspects, such as managing CIDR ranges or configuring firewalls, it\u2019s advisable to seek the assistance of a Linux system administrator. Keep in mind that OpenAI\u2019s IP address ranges are subject to change, which can turn this process into an ongoing effort to keep up with updates\u2014a game of cat and mouse.<\/p>\n\n\n\n Below is a list<\/a> of user agents used by OpenAI\u2019s crawlers and fetchers, along with their associated CIDR or IP address ranges. To block OpenAI\u2019s plugin AI bot, you can configure your web server firewall to restrict access from specific IP ranges, such as Here\u2019s an example of how to block a CIDR or IP range using the Using UFW:<\/strong><\/p>\n\n\n\n Using iptables:<\/strong><\/p>\n\n\n\n These commands prevent any traffic originating from the specified IP range from accessing your server. Make sure to review and update your firewall rules periodically to account for changes in OpenAI\u2019s IP ranges. If you\u2019re unfamiliar with configuring firewalls, consider enlisting the help of a Linux system administrator.<\/p>\n\n\n\nrobots.txt<\/code>, you can block these AI crawlers from accessing your website or blog.<\/p>\n\n\n\n
<\/figure>\n\n\n\n
What is a robots.txt file?<\/h2>\n\n\n\n
robots.txt<\/code> file is a tool that allows website owners to manage how search engine crawlers interact with their content. It gives you the power to disallow specific bots from crawling your site, ensuring greater control over your content.<\/p>\n\n\n\n
user-agent: {BOT-NAME-HERE}\ndisallow: \/<\/code><\/pre><\/div>\n\n\n\n
User-agent: {BOT-NAME-HERE}\nAllow: \/<\/code><\/pre><\/div>\n\n\n\n
Where to place your robots.txt file?<\/h2>\n\n\n\n
https:\/\/example.com\/robots.txt\nhttps:\/\/blog.example.com\/robots.txt<\/code><\/pre><\/div>\n\n\n\n
Learn More About
robots.txt<\/code><\/h2>\n\n\n\n
robots.txt<\/code> with these helpful resources:<\/p>\n\n\n\n
\n
robots.txt<\/code> works and how to configure it effectively for your site.<\/li>\n\n\n\n
robots.txt<\/code> in managing web crawler access.<\/li>\n<\/ul>\n\n\n\n
How to block AI crawlers bots using the robots.txt file<\/h2>\n\n\n\n
user-agent: {AI-Crawlers-Bot-Name-Here}\ndisallow: \/<\/code><\/pre><\/div>\n\n\n\n
Blocking Google AI (Bard and Vertex AI generative APIs)<\/h2>\n\n\n\n
User-agent: Google-Extended\nDisallow: \/<\/code><\/pre><\/div>\n\n\n\n
Additional Information on User Agents and AI Bots<\/h3>\n\n\n\n
\n
robots.txt<\/code> file remains one of the most effective methods to guide compliant crawlers and restrict access to your content.<\/li>\n<\/ul>\n\n\n\n
robots.txt<\/code> directives.<\/p>\n\n\n\n
Blocking OpenAI using the robots.txt file<\/h2>\n\n\n\n
User-agent: GPTBot\nDisallow: \/\nUser-agent: ChatGPT-User\nDisallow: \/<\/code><\/pre><\/div>\n\n\n\n
1: The\u00a0ChatGPT-User is used by\u00a0plugins<\/strong>\u00a0in ChatGPT<\/h3>\n\n\n\n
23.98.142.176\/28<\/code>.<\/p>\n\n\n\n
ufw<\/code> command or
iptables<\/code> on your server:<\/p>\n\n\n\n
sudo ufw deny from 23.98.142.176\/28 <\/code><\/pre><\/div>\n\n\n\n
sudo iptables -A INPUT -s 23.98.142.176\/28 -j DROP <\/code><\/pre><\/div>\n\n\n\n
2: The\u00a0GPTBot \u00a0is used by ChatGPT<\/h3>\n\n\n\n