|
After understanding the terminology used, if you want to set up a rule, you can enter /robots.txt in the domain address and set it. First, let me introduce some terms: robots.txt directive user-agent: The name of the crawler to which the rule applies. disallow: Blocks the user agent from crawling a directory or page. allow: Allow user agents to crawl directories or pages. Please note that this only applies to Googlebot. sitemap: A list file that lists all resources on a website . Crawling bot names: GooglebotSlurp (Yahoo) robots.txt rules Let's learn about some rules using robots.txt as an example. Block crawling If you want to block all content on your site from crawlers, you can set it as follows: Block crawling robots.txt 1 Block crawling of robots.txt As explained earlier, user-agent refers to a crawling bot, and those marked with * refer to all crawling bots. Disallow is a setting that prevents the page from being crawled.
You can block crawling by marking it with /. This means that crawling of all pages, including the homepage, is blocked Cambodia Phone Number Data from all crawling bots . Allow crawling Conversely, there are also settings that allow access to crawling bots. Allow crawling of robots.txt Allow crawling of robots.txt If there is no / in disallow, it indicates that crawling is allowed. In other words, all crawling bots are allowed to crawl all pages, including the homepage. Block specific crawlers – specific folders If you want to block a specific folder from crawlers, you can set it as follows. Block crawling of robots.txt – specific folder Googlebot is the name of Google's crawling bot. If you want to block this bot, just enter the bot's name in user-agent. If you want to block crawling bots only in a specific location, you can select the location. Block crawling of robots.txt - specific location Block crawling of robots.txt – specific locations This may block crawling of pages containing URLs such as . It is also possible to block content crawling of multiple directories at the same time.

Block crawling of robots.txt - specific location Block crawling of robots.txt – specific locations In the case of the above file, the crawler cannot access the calendar and junk directories. Block specific crawlers – specific web pages You can block crawlers from accessing specific web pages. Block robots.txt crawling - specific web pages Block robots.txt crawling – specific web pages With this setting , you can block Naver's crawler Yeti from crawling a specific page at . set the settings introduced above at the same time. Allow all crawlers except one Block robots.txt crawling – allow all crawlers except one In this case, only Unnecessarybot will be blocked and the remaining crawlers will be able to access the site. Conversely, if you only want to allow one crawler, you can configure it as follows: Only one crawler allowed Block crawling of robots.txt – allow only one crawler With this setting, only Googlebot-news is allowed and access to other crawlers is blocked.
|
|