Search engines are equipped with robots (web spiders or bots) that crawl and index web pages. If your site or page is under construction or contains inappropriate content, robots can be prevented from crawling and indexing your site. Learn how to block entire sites, pages and links using robots.txt, or specific pages and links using html tags. Read on to find out how to prevent certain bots from accessing content.
Method 1 of 2: Block search engines with robots.txt
Step 1. Review your robots.txt file
A robots.txt file is a simple text or ASCII file that tells search engine spiders what parts of a site they can access. Files and folders listed in the robots.txt file cannot be crawled and indexed by search robots. Use a robots.txt file if:
- you want to hide certain content from search engines;
- you are in the process of developing a website and are not ready to be crawled and indexed by search engine spiders;
- you want to restrict access to reputable bots.
Step 2. Create and save a robots.txt file
To create a file, open a regular text or code editor. Save the file as robots.txt. The file name must be written in lowercase letters.
- Don't forget to add an "s" at the end.
- Select the extension “.txt” when saving the file. If you are using Word, select the "Plain Text" option.
Step 3. Create a robots.txt file with an unconditional disallow directive
The unconditional disallow directive will block search robots from all major search engines, thereby avoiding crawling and indexing the site. Add the following lines to the text file:
User-agent: * Disallow: /
Step 4. Create a robots.txt file with a conditional allow directive
Instead of blocking all bots, consider blocking specific spiders from accessing certain parts of the site. The main commands of the conditional allow directive include:
Blocking a specific bot: replace the asterisk next to User-agent on googlebot, googlebot-news, googlebot-image, bingbot or teoma.
- Blocking a directory or its contents:
User-agent: * Disallow: / sample-directory /
- Blocking a web page:
User-agent: * Disallow: /private_file.html
- Image blocking:
User-agent: googlebot-image Disallow: /images_mypicture.jpg
- Blocking all images:
User-agent: googlebot-image Disallow: /
- Blocking a specific file format:
User-agent: * Disallow: /p*.gif$
Step 5. Spur bots to index and crawl your site
Many people not only do not block, but, on the contrary, welcome the attention of search engine spiders to their site so that it is fully indexed. This can be achieved in three ways. First, you can opt out of generating a robots.txt file. If the robot does not find the robots.txt file, it will continue crawling and indexing your entire site. Second, you can create an empty robots.txt file. The robot will find the robots.txt file, see that it is empty, and continue crawling and indexing the site. Finally, you can create a robots.txt file with an unconditional permission directive using the code:
User-agent: * Disallow:
Step 6. Save the text file in the root directory of the domain
After editing your robots.txt file, save your changes. Paste the file into the root directory of the site. For example, if you have a domain www.yourdomain.com, place your robots.txt file at www.yourdomain.com/robots.txt.
Method 2 of 2: Blocking search engines with meta tags
Step 1. Check out the HTML meta robots
The robots meta tag allows programmers to set parameters for bots or search engine spiders. These tags prevent bots from indexing and crawling the entire site or parts of it. They can also be used to block a specific search engine spider from indexing content. These tags are specified in the header of the HTML file.
This method is usually used by programmers who do not have access to the site's root directory
Step 2. Deny bots access to one page
Page indexing and / or following links on the page can be disabled for all bots. This tag is usually used when a site is under construction. It is highly recommended that you remove this tag after the site has finished working. If you do not remove the tag, the page will not be indexed or searchable through search engines.
- Prevent bots from indexing the page and following any of the links:
- Prevent all bots from indexing the page:
- Prevent all bots from following links on the page:
Step 3. Allow bots to index the page, but not follow its links
If you allow bots to index the page, it will be indexed. If you prevent spiders from following links, the path of links from this page to others will be blocked. Insert the following line of code into the header:
Step 4. Allow the search engine spiders to follow the links, but not index the page
If you allow bots to follow links, the path of links from this page to others will remain open. If you prevent bots from indexing a page, it will not appear in the index. Insert the following line of code into the header:
Step 5. Block the outgoing link
To hide one link per page, place the tag rel inside the link tag. Use this tag to block links on other pages that lead to the specific page you want to block.
Insert a link to the blocked page
Step 6. Block a specific search spider
Instead of blocking access to the page for all bots, set a ban on page crawling and indexing for only one bot. To do this, replace the word "robots" in the meta tag with the name of a specific bot. Examples: googlebot, googlebot-news, googlebot-image, bingbot and teoma.
Step 7. Spur bots to crawl and index the page
If you want to make sure your page gets indexed and links follow, add the "robots" meta tag to your title. Use the following code: